S S symmetry

Article Multi-Level P2P Traffic Classification Using Heuristic and Statistical-Based Techniques: A Hybrid Approach

Max Bhatia 1, Vikrant Sharma 1, Parminder Singh 1,* and Mehedi Masud 2,*

1 Department of Computer Science Engineering, Lovely Professional University, Punjab 144001, India; [email protected] (M.B.); [email protected] (V.S.) 2 Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia * Correspondence: [email protected] (P.S.); [email protected] (M.M.)

 Received: 31 October 2020; Accepted: 17 December 2020; Published: 20 December 2020 

Abstract: Peer-to-peer (P2P) applications have been popular among users for more than a decade. They consume a lot of network bandwidth, due to the fact that network administrators face several issues such as congestion, security, managing resources, etc. Hence, its accurate classification will allow them to maintain a for various applications. Conventional classification techniques, i.e., port-based and payload-based techniques alone, have proved ineffective in accurately classifying P2P traffic as they possess significant limitations. As new P2P applications keep emerging and existing applications change their communication patterns, a single classification approach may not be sufficient to classify P2P traffic with high accuracy. Therefore, a multi-level P2P traffic classification technique is proposed in this paper, which utilizes the benefits of both heuristic and statistical-based techniques. By analyzing the behavior of various P2P applications, some heuristic rules have been proposed to classify P2P traffic. The traffic which remains unclassified as P2P undergoes further analysis, where statistical-features of traffic are used with the C4.5 decision tree for P2P classification. The proposed technique classifies P2P traffic with high accuracy (i.e., 98.30%), works with both TCP and UDP traffic, and is not affected even if the traffic is encrypted.

Keywords: heuristic-based classification; multi-level P2P traffic classification; P2P-port based classification; statistical-based classification

1. Introduction The P2P networking technology is used to share and distribute media, documents, software, etc., among peers. A decade ago, peers on the Internet used the client-server architecture, where the clients request data from the server and the server responds with the requested data. Due to this reason, the majority of the Internet traffic used to be asymmetric in nature. However, with the evolution of P2P traffic, network traffic started becoming symmetric. In such a case, a peer starts acting simultaneously as a client and server, thereby downloading and uploading the data at the same time. Due to this factor, as well as a rise in the number of P2P users, it has become one of the major contributors of internet traffic. It has ended the dominance of other numerous application protocols (for example, FTP, SMTP, HTTP, etc.), which used to rule the Internet more than a decade ago [1]. There has been a significant trend of P2P file-sharing, in recent years, through P2P applications where audios, videos, games, and software are being shared or distributed, significantly large in size [2]. The main issue with P2P traffic is that it consumes a large amount of network bandwidth [1,3–5]. Conventional network devices cannot handle the traffic of P2P applications, due to the fact that network administrators and ISPs face various challenges such as providing excellent broadband experience to customers, purchasing of backbone links, and upstreaming bandwidth, which are costly.

Symmetry 2020, 12, 2117; doi:10.3390/sym12122117 www.mdpi.com/journal/symmetry Symmetry 2020, 12, 2117 2 of 22

The main issue with P2P traffic is that it consumes a large amount of network bandwidth [1,3– 5]. Conventional network devices cannot handle the traffic of P2P applications, due to the fact that networkSymmetry 2020 administrators, 12, 2117 and ISPs face various challenges such as providing excellent broadband2 of 22 experience to customers, purchasing of backbone links, and upstreaming bandwidth, which are costly. Considering the overall network traffic, which is composed of traffic from various application Considering the overall network traffic, which is composed of traffic from various application protocols protocols (for example, SMTP, FTP, DNS, HTTP, P2P, HTTPS, etc.), traffic from P2P applications (for example, SMTP, FTP, DNS, HTTP, P2P, HTTPS, etc.), traffic from P2P applications alone consumes alone consumes a significant portion of the available network bandwidth. Due to this reason, other a significant portion of the available network bandwidth. Due to this reason, other kinds of application kinds of application protocols do not get a fair amount of network bandwidth, resulting in a poor protocols do not get a fair amount of network bandwidth, resulting in a poor Quality of Service for Quality of Service for such applications. Therefore, it is required to monitor and classify P2P traffic, such applications. Therefore, it is required to monitor and classify P2P traffic, which will help ISPs and which will help ISPs and network administrators perform various tasks, for example, implementing network administrators perform various tasks, for example, implementing network specific policies for network specific policies for providing Quality of Service to each network application, implementing providing Quality of Service to each network application, implementing billing mechanisms based on billing mechanisms based on the type of traffic used by customers, implementing network security the type of traffic used by customers, implementing network security measures, addressing network measures, addressing network congestion issues, etc. Furthermore, enterprises can either limit or ban congestion issues, etc. Furthermore, enterprises can either limit or ban P2P traffic from evading P2P traffic from evading network congestion and maintaining Quality of Service for various network congestion and maintaining Quality of Service for various applications in their network, applications in their network, as shown in Figure 1. as shown in Figure1.

Figure 1. ControllingControlling the quality of service.

Nowadays, classifying classifying P2P P2P traffic traffi accuratelyc accurately is a is difficult a diffi culttask taskas various as various P2P applications, P2P applications, either masqueradeeither masquerade or encrypt or encrypttheir traffic their to avoid traffic detectio to avoidn [6,7]. detection There [are6,7 ].some There techniques are some for techniquesclassifying thefor network classifying traffic, the such network as port-based, traffic, such payload-based, as port-based, and payload-based, Classification in andthe Dark Classification (which includes in the statistical-based,Dark (which includes pattern, statistical-based, or heuristic-based pattern, techniques) or heuristic-based [7–9]. Since techniques) many P2P [7 –applications9]. Since many are masqueradingP2P applications their are traffic masquerading either by disguising their traffi portc either numbers by disguising or encrypting port payloads, numbers port-based or encrypting and payload-basedpayloads, port-based techniques and payload-based are inefficient techniques in accurately are ineclassifyingfficient in P2P accurately traffic. classifyingClassification P2P in tra theffic. DarkClassification techniques in therely Dark on techniquesthe traffic’s rely statistica on thel trafeaturesffic’sstatistical or behavioral features patterns or behavioral to perform patterns the classificationto perform the and classification hence, do not and rely hence, on port do notnumb relyers on or port payload numbers contents or payload of the traffic. contents They of are the effectivetraffic. They in classifying are effective P2P in traffic classifying these P2P days. tra Theyffic these can days.also classify They can encrypted also classify traffic encrypted and unknown traffic applicationsand unknown from applications target classes from but target cannot classes perform but cannot the traffic perform classification the traffi withc classification high accuracy with as high the payload-basedaccuracy as the technique payload-based [6]. Theref techniqueore, to [6 ].achieve Therefore, a high to achieveclassifica ation high accuracy classification of P2P accuracy traffic, ofa singleP2P tra methodffic, a single alone method may not alone be sufficient. may not beWe su proposefficient. a We hybrid propose technique a hybrid in this technique paper, inwhich this paper,is the combinationwhich is the combinationof heuristic-based of heuristic-based and statistical-based and statistical-based techniques. techniques. The main main aim aim of of this this paper paper is to is propose to propose a hybrid a hybrid technique technique for P2P for traffic P2P classification, traffic classification, which accomplisheswhich accomplishes the following the following tasks: tasks: ability to classify P2P traffic with high accuracy. • ability to work with both TCP and UDP protocols (since various P2P applications use either TCP • or UDP or both protocols for communication). Symmetry 2020, 12, 2117 3 of 22

involves less computation in classifying P2P traffic (by not relying on the DPI approach for • classification) in comparison to various existing hybrid techniques. ability to classify P2P traffic even if it is encrypted. • The experiments performed using the proposed hybrid technique achieved a high classification accuracy of 98.30%, which is higher than other hybrid/non-hybrid techniques, and also combines the benefits of heuristic-based (less computation as compared to DPI) as well as statistical-based (scalability) techniques. Further, unlike various existing hybrid techniques, the proposed technique does not rely on the signature-based technique. Rather, it utilizes a set of heuristic rules which comparatively involves less computation in classifying P2P traffic. In addition to that, the heuristics proposed in this paper perform equally with both TCP and UDP traffic flows and are not affected even if a traffic flow is encrypted. The rest of the paper is organized as follows. Section2 discusses the related work. Section3 analyzes various P2P traffic classification techniques. Section4 discusses the multi-level P2P tra ffic classification technique. Section5 discusses the evaluation criteria and experimental results. Finally, Section6 concludes the research work.

2. Related Work P2P applications have become very popular since the past decade, and the traffic generated by such applications continues to grow as new applications keep emerging and many peers join the network to use them. P2P traffic is one the largest contributors to internet traffic [10], which has a major impact on it due to its large volume and long connection time, leading to network congestion. Its traffic flows in large amounts in both directions, i.e., P2P applications act as a client and server concurrently by downloading the data from other peers and serving the request of multiple other peers by uploading the data requested. P2P applications are generally utilized for sharing large files among various peers. Once initiated, these applications require little or no human intervention and are usually left unattended for a long time, which results in a large network activity throughout the day [11]. Therefore, such kind of traffic can be observed naturally over 24 h. Conventional traffic classification techniques such as port-based and payload-based are ineffective in classifying P2P traffic due to the various limitations associated with them. Hence, modern classification techniques such as statistical-based or heuristic-based are employed for this purpose. Reddy and Hota [12] used the heuristic-based technique by analyzing connection patterns of the host to identify P2P traffic and found the average detection rate of 99%. They achieved this detection rate by classifying the TCP flows as non-P2P, which communicates over default port number 80. However, if a P2P application masquerades using a TCP port number (e.g., 80 used by HTTP) [5] or a new P2P application protocol emerges with different communication patterns, then it may not satisfy any of the proposed heuristics. Hence, it would lead to many miss-classifications, due to which a high detection rate may not be achieved. Bozdogan et al. [13] assessed four supervised and one un-supervised ML algorithms, namely SVM, C4.5 decision tree, Ripper, Naïve Bayesian, and K-means, respectively, for the identification of P2P network applications. They found that Ripper and C4.5 algorithms have a similar performance with the detection rate ranging between 58.9–99.1% and 15.6–98.1%, respectively. However, the evaluation was performed using only three P2P applications namely BitComment, BitTorrent, and uTorrent. Tseng et al. [14] proposed a methodology to classify P2P traffic on the basis of aggregation clustering. A similar traffic flow was aggregated by determining the correlation between clusters through their distance ratio. This approach classifies both known and unknown traffic flows with an overall accuracy of 90.50%. Chuan et al. [15] utilized the Bat algorithm to search the most relevant parameters, which can be used with SVM for classifying P2P traffic and was able to achieve the classification accuracy ranging between 86.77–91.34%. Abdalla et al. [16] proposed a multi-stage method for feature selection in order to create a subset of optimal statistical traffic features that can be utilized for online classification of P2P traffic. The authors used J48 and Naïve Bayes as ML algorithms, which achieved classification accuracy and recall rates ranging between Symmetry 2020, 12, 2117 4 of 22

96.29–99.78% and 86.9–99.8%, respectively using a set of six proposed features. However, these six proposed features alone may not be effective in classifying existing P2P applications, which may have evolved (since the creation of public datasets which are used here) or newer P2P application protocols as they emerge. Jamil et al. [17] proposed an approach to develop a model which combines SNORT rules (which is based on the packet payload) and the ML algorithm for classifying P2P traffic. The technique used fuzzy-rough and Chi-square as feature selection algorithms and evaluated the performance of 3 ML algorithms namely SVM, C4.5 decision tree, and ANN and achieved a 99.7% classification accuracy using the combination of ANN and C4.5. However, the technique relies on the payload-based approach (SNORT), which has various limitations. Nazari et al. [18] proposed an approach called DSCA, which is based on the DPI technique for the identification of various P2P and non-P2P applications over an encrypted network. The proposed technique used four modules, namely feature-extractor (for maintaining the flows), inline-DPI (for labelling traffic flows and detecting new applications), stream-processor (for handling flows between the feature-extractor and stream-classifier), and stream-classifier (for building the classification function). The experimental results achieved a maximum classification accuracy of 96.75%. However, this technique also relies on the payload based approach, which has various limitations. Ye and Cho [19–21] proposed a hybrid technique to classify P2P traffic in two steps. The first step performs classification at the packet-level by combining signature-based and heuristic-based techniques. The second step performs classification at the flow-level by combining statistical-based and heuristic-based techniques to classify the remaining unknown traffic. The authors achieved an overall flow-accuracy and byte-accuracy ranging between 97.70–98.19% and 97.06–99.82%, respectively. However, their technique does not classify the UDP traffic and also relies on the payload-based approach (which has various limitations). Khan et al. [22] proposed a hybrid approach for classifying the traffic into normal P2P and P2P-. In the first stage, the non-P2P traffic is separated by using the mechanism of well-known port numbers, DNS query filtering, and flow-counting rules. The remaining traffic is considered as P2P traffic and is fed into the second stage where the wrapper method is utilized for selecting traffic features and the decision tree algorithm is employed for classifying the traffic either as normal P2P or P2P-botnet. The experimental results achieved the classification accuracy of 94.4%. However, this technique considers the network traffic to be non-P2P (in the first stage), which uses well-known port numbers (e.g., 20, 21, 80, 443, etc.) for communication. This could lead to many false negative cases, since many P2P applications can masquerade using these well-known port numbers and hence, such traffic can go undetected. A multi-level P2P traffic classification technique is proposed in this paper, which is a hybrid approach. It utilizes the combination of heuristic-based and statistical-based techniques for classifying P2P traffic. In addition, it does not rely on the payload-based technique for classification (which has various limitations), but rather utilizes a set of heuristic rules proposed in this paper, which comparatively involves less computation, performs equally with both TCP and UDP traffic flows, and is not affected even if the traffic flow is encrypted.

3. Analysis of Existing P2P Traffic Classification Techniques Earlier, the task of classifying Internet traffic was easy and simple as it required the information of port numbers only to perform the traffic classification. However, as P2P applications evolved, they started masquerading the traffic using well-known port numbers (for example: HTTP, HTTPS, etc.) or random port numbers to avert detection. Therefore, the port-based technique started becoming inefficient. As a result, another technique was utilized, which is based on the packet payload of the traffic for its classification. However, this technique also has its fair share of limitations, due to the fact that newer classification techniques are adopted currently, which are either statistically based, pattern or heuristic-based, or even a hybrid technique to overcome the drawbacks of traditional classification techniques. A brief description of these techniques along with the related work is mentioned below. Symmetry 2020, 12, 2117 5 of 22

3.1. Port-Based Traffic Classification This technique classifies the network traffic based on the TCP/UDP port number present in the transport layer of a packet header. Internet assigned numbers authority (IANA) [23] defines well-known port numbers, which are associated with each application protocol. For example, FTP traffic is transferred using port number 20 and 21, HTTP traffic is transferred using port number 80, SMTP traffic is transferred using port number 25, etc. In this technique, the TCP/UDP port number is extracted from the packet header. In the case of TCP connection, a classifier analyzes the SYN packets (packets which are used for the three-way handshake for establishing the connection) target port number from the registered list of port numbers defined by IANA [23], for classifying the network traffic to a particular type. Similarly, the UDP traffic is classified by using port numbers used by the hosts during communication, but unlike TCP, it does not involve any connection establishment. Gomes et al. [6] mentioned several well-known port numbers used by various P2P application protocols, some of which are shown in Table1. The prime advantage of this classification technique lies in its simplicity to implement as it does not involve any calculations to classify network traffic. For newer applications, only the database is required to be updated with new port numbers. However, with the proliferation of the Internet, various P2P applications have emerged and evolved. They either use random port numbers for communication (which may not be registered with IANA) [24], or use the masquerading technique to disguise their traffic so that they appear to originate from a well-known protocol (e.g., HTTP, HTTPS, etc., which is not blocked or filtered) and hence, such kind of traffic goes undetected. Therefore, the port-based technique is inefficient in classifying all P2P traffic correctly and has become obsolete now [25–27].

Table 1. List of well-known ports used by various peer-to-peer (P2P) protocols.

Protocols TCP/UDP Port Numbers BitTorrent 6881–6999 Direct Connect 411, 412, 1025–32,000 eDonkey 2323, 3306, 4242, 4500, 4501, 4661–4674, 4677, 4678, 4711, 4712, 7778 FastTrack 1214, 1215, 1331, 1337, 1683, 4329 Yahoo (messages/video/voice) 5000–5010, 5050, 5100 Napster 5555, 6257, 6666, 6677, 6688, 6699–6701 MSN (voice/file-transfer) 1863, 6891–6901 MP2P 10,240–20,480, 22,321, 41,170 Kazaa 1214 Gnutella 6346–6347 ARES Galaxy 32285 AIM (messages/video) 1024–5000, 5190

Madhukar and Williamson [28] showed that the port-based technique could not classify internet traffic correctly. Karagiannis et al. [29] found that 30% to 70% of the traffic used random port numbers, which were generated by P2P applications and various P2P applications used the well-known port 80 (i.e., HTTP) for transferring their data. Moore and Papagiannaki [26] could achieve a maximum of 70% byte accuracy using the port-based classification technique. Since the port-based traffic classification is a conventional technique, its related work is referred to in [6].

3.2. Payload-Based Traffic Classification This technique (also known as or DPI) makes use of packet payload to classify network traffic. It utilizes the database containing signatures of application protocols which Symmetry 2020, 12, 2117 6 of 22 have been stored previously. In addition, it inspects the packet payload of the traffic bit-wise to locate a bit-stream containing pre-defined byte sequence (called signatures) of the application protocol. In this way, the traffic is classified accurately when the packet-signatures extracted from network application maps with one of the packet-signatures are already stored in the database. For example, ‘ GET’ string \ is found in HTTP traffic, ‘xe3 x38 string is found in eDonkey P2P traffic, etc. The prime advantage of \ 0 this technique is classifying the network traffic with very high accuracy. However, this technique also has various limitations [6,30,31], which are mentioned below:

it involves a great amount of processing load and complexity. • it is not feasible in high-speed networks as it needs a large amount of computational resources for • inspecting the traffic. it is almost impossible to classify the network traffic if it contains a proprietary protocol or if the • traffic is encrypted. it may violate privacy policies of some organizations which do not allow a direct inspection • of packets.

Since the payload-based traffic classification is a conventional technique, its related work is referred to in [6].

3.3. Classification in the Dark This approach overcomes the drawbacks of port-based and payload-based techniques (as mentioned in previous sections). It classifies network traffic by either using statistical properties of packets associated with a traffic flow [25] (known as statistical-based technique), or by analyzing communication patterns of a traffic flow (known as heuristic-based technique). However, this technique does not give as high accurate results as the payload-based technique. The basic idea behind the heuristic-based technique is to classify the network traffic using a pre-defined set of heuristic rules. These rules are constructed by analysing the communication patterns of the traffic, for example, a number of outbound connections made by the host for communication, whether the host is acting as a client and server concurrently during the communication, etc. Perenyi et al. [32] proposed a technique to identify P2P traffic using a set of six heuristics, which achieved a 99.14% recall rate. However, they cannot achieve high classification results currently, since they were made to work with the older dataset consisting of P2P applications, which are either phased-out or have already evolved since then. Yan et al. [33] proposed a novel approach based on flow statistics and host-based heuristics to identify P2P traffic and achieved a 93.9% flow accuracy and 96.3% byte accuracy. However, the results of this approach degrade if the network address translation (NAT) is in use or if the traffic uses dynamic IP addresses. Wang et al. [34] proposed a technique to utilize behavioural features of traffic flows with the C4.5 decision tree to identify P2P traffic. The experimental results achieved precision values ranging between 90.96–93.66% and recall values ranging between 86.69–95.73% in identifying PPTV, , and Thunder. Zhang et al. [35] proposed the component-based technique (i.e., based on the graph theory) to analyze various P2P applications which use the UDP protocol for communication. They argue that component-level statistics can be used to detect P2P traffic reliably and accurately. However, they did not perform the experimental analysis to show its effectiveness. The basic idea behind the statistical-based technique is that it classifies the traffic using its flow-level or packet-level properties, for example, the total bytes received/sent, packet size, packet inter-arrival time, duration of traffic flow, etc., which can be used collectively or individually to calculate statistical measures such as the average, variance, and probability density function. The assumption here is that different applications generate traffic flows that possess unique characteristics. With the increase in the number of traffic features, mapping of the traffic features with the corresponding classes manually becomes difficult. Due to this reason, ML algorithms are generally used along with statistical features of traffic. Here, a reference model is built with the help of pre-labelled training, which is then used to classify the traffic of the testing dataset. Sun and Chen [36] utilized the C4.5 decision tree for Symmetry 2020, 12, 2117 7 of 22 classifying applications which use TCP flows. This technique analyzed the amount of data first sent by the hosts continuously during the communication and achieved classification accuracy ranging between 97.648–99.694%. However, it only classifies traffic associated with TCP flow and does not work on UDP flow. Gong et al. [37] proposed an incremental algorithm to improve the learning of existing SVM, which has good space and time complexity and achieved the identification accuracy of 87.89% in identifying P2P traffic. Deng et al. [38] proposed the ensemble learning model which uses the combination of random forests and feature weighted naive Bayes (FWNB) to classify P2P traffic. The experimental results achieved an overall classification accuracy of 92.47%. Qin et al. [39] developed a framework called CUFTI to identify P2P traffic using the payload length as well as the direction of the control packets (which appears at the start of a flow) as the flow features. However, the proposed technique used only three applications namely PPlive, BitTorrent, and Thunder for the experiment and achieved FNR and FPR rates ranging between 8.47–34.57% and 3.49–22.26%, respectively. He et al. [40] proposed a fine-grained P2P traffic classification approach by analyzing the hosts. The experimental results achieved a 97.22% true positive rate. However, the proposed technique focused only on the P2P file-sharing traffic for classification purposes. Ertam and Avci [41] used the kernel based extreme learning machine (KELM) approach combined with the genetic algorithm (GA) for feature selection in classifying the Internet traffic (which also included P2P traffic) and the experimental results achieved an average classification accuracy of 96.57%. Sun et al. [42] classified the network traffic using a model named TrAdaBoost, which is a transfer learning model and is a modified version of AdaBoost. The proposed technique classified various kinds of network traffic (such as www, mail, database, etc.) where P2P traffic was classified with a 91.8% accuracy. Lim et al. [43] utilized deep learning models namely CNN and ResNet to classify the network traffic. It used packet payloads as image data to create datasets and train deep learning models. Traffic from eight applications were used for experimental purposes, which included only two P2P applications (i.e., Skype and BitTorrent) and achieved the f1-score of 0.97. The largest contributor of overall P2P traffic in the Internet includes file-sharing applications (e.g., BitTorrent) and VoIP applications (e.g., Skype) [10]. Therefore, there are also various studies which specifically focus on classifying such P2P applications. For example, techniques proposed in [44,45] focus on classifying the BitTorrent traffic, whereas techniques proposed in [46–51] focus on classifying the VoIP traffic. Moreover, there are some hybrid techniques which classify P2P traffic. Li et al. [52] proposed a hybrid classification technique using the combination of C4.5 decision tree, port-based, and payload-based techniques in a two-step process and achieved an overall classification accuracy of 96.03%. Chen et al. [53] proposed a hybrid technique by combining the hardware classifier (based on the network processor) and software classifier based on FNT for classifying P2P traffic. The proposed technique achieves the accuracy of 95.67%, but it relies on a dedicated hardware for the classification of P2P traffic. Keralapura et al. [5] proposed a two-stage classifier known as SLTC (self-learning traffic classifier) to classify P2P traffic and achieved the detection rate of 95%. Nair and Sajeev [54] proposed a technique which uses the combination of pattern-based and statistical-based approaches to classify the traffic into P2P and non-P2P and achieved a maximum classification accuracy of 91.42%. The authors proposed another hybrid technique in [55], where they classified P2P traffic using the packet header and payload information in the statistical-based technique (which utilized the C4.5 ML algorithm) and achieved the detection rate of 95%. Most of the hybrid techniques discussed above classify P2P traffic by making use of the signature/payload-based technique which has various limitations (as mentioned in the previous section). Therefore, they may not be able to achieve a good classification accuracy if the traffic is encrypted or contains newer/proprietary application protocols. Apart from this, a single (non-hybrid) technique may not be sufficient for classifying P2P traffic, since depending on the approach to be utilized for classifying P2P traffic, it may not be applicable for real-time classification (due to the large computation involved) or may not be able to classify newer/proprietary application protocols [20]. Symmetry 2020, 12, 2117 8 of 22

Therefore, we propose a multi-level P2P traffic classification technique which is a hybrid approach. It combines heuristic-based and statistical-based techniques to achieve a high accuracy of 98.30% in classifying P2P traffic. In addition, the classification process involves less computation, since unlike other various hybrid approaches, it does not make use of the signature/payload-based technique for classifying P2P traffic. Symmetry 2020, 12, 2117 8 of 22 4. Multi-Level P2P Traffic Classification Technique 4. Multi-Level P2P Traffic Classification Technique Based on the previous analysis, a multi-level P2P traffic classification technique is proposed. It is dividedBased into twoon the steps, previous where analysis, the first a step multi-level performs P2P the traffic traffi cclassification classification technique at a packet-level is proposed. and It the secondis divided step performsinto two steps, the tra whereffic classification the first step at perf a flow-level.orms the traffic classification at a packet-level and the second step performs the traffic classification at a flow-level. 4.1. System Model for Classifying P2P Traffic 4.1. System Model for Classifying P2P Traffic Figure2 illustrates the overall system of the P2P tra ffic classification process, which is sub-divided into a two-stepFigure 2 illustrates process, the namely overall packet-level system of the process P2P traffic and classificat flow-levelion process, process. which In the is sub-divided packet-level into a two-step process, namely packet-level process and flow-level process. In the packet-level classification process, the P2P-port based technique in combination with the packet-heuristics based classification process, the P2P-port based technique in combination with the packet-heuristics based technique performs a traffic classification. The traffic which remains un-classified as P2P (in the first technique performs a traffic classification. The traffic which remains un-classified as P2P (in the first step) is then fed to the flow-level classification process where flow-heuristics are combined with the step) is then fed to the flow-level classification process where flow-heuristics are combined with the statistical-based technique to perform a classification of the remaining traffic. The proposed technique statistical-based technique to perform a classification of the remaining traffic. The proposed technique is implementedis implemented in in java java with with the the helphelp of the jNetPcap jNetPcap library library [56] [56 and] and Weka Weka [57]. [57 ].

FigureFigure 2. 2.Multi-level Multi-level P2PP2P tratrafficffic classification classification technique. technique.

WhileWhile performing performing thethe task of of traffic traffi cclassificatio classification,n, a combination a combination of five of network five network parameters parameters (i.e., (i.e.,source-IP, source-IP, destination-IP, destination-IP, source-port source-port destination-p destination-port,ort, and protocol) and protocol) are generally are generally used to used define to definethe thetraffic-flow traffic-flow [36]. [36 All]. All the the communication communication that thathappen happenss among among the two the processe two processess will share will these share same these samefive five parameters. parameters. In the In thepacket-level packet-level classification classification process, process, packets packets belonging belonging to the to thesame same flow flow are are recognizedrecognized by calculatingby calculating the hash-keythe hash-key of the of packet the throughpacket through combining combining the five-tuple the five-tuple flow information, flow as showninformation, in Figure as shown3. In this in Figure way, packets 3. In this belonging way, packets to the belonging same flow to andthe same travelling flow and in either travelling direction in willeither have direction the same will hash-key. have the same This hash-keyhash-key. This is useful hash-key to find is useful out if to the find packets out if the belonging packets belonging to the flow haveto the already flow have been already classified been as classified P2P or not. as P2P or not.

Figure 3. Calculation of the packet hash-key.

Symmetry 2020, 12, 2117 8 of 22

4. Multi-Level P2P Traffic Classification Technique Based on the previous analysis, a multi-level P2P traffic classification technique is proposed. It is divided into two steps, where the first step performs the traffic classification at a packet-level and the second step performs the traffic classification at a flow-level.

4.1. System Model for Classifying P2P Traffic Figure 2 illustrates the overall system of the P2P traffic classification process, which is sub-divided into a two-step process, namely packet-level process and flow-level process. In the packet-level classification process, the P2P-port based technique in combination with the packet-heuristics based technique performs a traffic classification. The traffic which remains un-classified as P2P (in the first step) is then fed to the flow-level classification process where flow-heuristics are combined with the statistical-based technique to perform a classification of the remaining traffic. The proposed technique is implemented in java with the help of the jNetPcap library [56] and Weka [57].

Figure 2. Multi-level P2P traffic classification technique.

While performing the task of traffic classification, a combination of five network parameters (i.e., source-IP, destination-IP, source-port destination-port, and protocol) are generally used to define the traffic-flow [36]. All the communication that happens among the two processes will share these same five parameters. In the packet-level classification process, packets belonging to the same flow are recognized by calculating the hash-key of the packet through combining the five-tuple flow information, as shown in Figure 3. In this way, packets belonging to the same flow and travelling in Symmetryeither direction2020, 12, 2117will have the same hash-key. This hash-key is useful to find out if the packets belonging9 of 22 to the flow have already been classified as P2P or not.

Symmetry 2020, 12, 2117 9 of 22 Figure 3. Calculation of the packet hash-key. We use the P2P flow table to store the flow-details of those flows, which are already classified as P2P.We The use information the P2P flow stored table in to storethis table the flow-detailswill be used of to those verify flows, whether which a particular are already traffic classified flow (underas P2P. analysis) The information is already stored classified in this earlier table as will P2P be flow used or to not. verify Moreover, whether we a particularuse a separate traffi ctable, flow namely(under analysis)the P2P destination-IP-table, is already classified to earlier store destination as P2P flow < orIP, not. port Moreover, > pair information we use a of separate those flows, table, whichnamely are the already P2P destination-IP-table, classified as P2P. This to store informatio destinationn is thepair heuristic-based information ofclassification those flows, process.which are already classified as P2P.This information is useful in the heuristic-based classification process.

4.2. Packet-Level Classification Process (First Step) 4.2. Packet-Level Classification Process (First Step) As shown in Figure2, initially a pre-processor is used which captures the network tra ffic and As shown in Figure 2, initially a pre-processor is used which captures the network traffic and filters out unwanted packets to create the traffic dataset. The traffic is then fed into the packet-level filters out unwanted packets to create the traffic dataset. The traffic is then fed into the packet-level classification process, which is illustrated in Figure4. Here, the packet-level classification process classification process, which is illustrated in Figure 4. Here, the packet-level classification process combinescombines the P2P-portP2P-port based based technique technique and and packet-heuristic packet-heuristic based based technique technique for classifying for classifying P2P tra P2Pffic. traffic.In this In level, this level, as the as network the network packet packet arrives arrive fors processing,for processing, its hash-keyits hash-key (as (as shown shown in in Figure Figure3) is3) iscalculated calculated and and mapped mapped with with the the information information stored stor ined the in P2P the flowP2P tableflow (whichtable (which contains contains the records the recordsof the already of the already classified classified P2P flows) P2P in flows) order in to order verify to whether verify whether the traffi cthe flow traffic of that flow packet of that is alreadypacket isclassified already classified as P2P flow as P2P or not. flow If or a matchnot. If a is matc found,h is thenfound, the then new the packets new packets are fetched are fetched and this and step this is steprepeated is repeated (as shown (as shown in Figure in 4Figure). 4).

FigureFigure 4. 4. Packet-levelPacket-level classification classification process process (first (first step). step).

4.2.1.4.2.1. P2P-Port P2P-Port Based Based Classification Classification TheThe packets packets are are initially initially fed fed to to the the P2P-port P2P-port ba basedsed classification classification technique, technique, where where the the TCP/UDP TCP/UDP portport number number is is extracted extracted from from the the packet packet header header and and mapped mapped with with the the data databasebase of of well-known well-known P2P P2P port numbers (shown in Table 1 in the previous section) used by various P2P applications. If a match is found, then its flow is classified as P2P. Accordingly, the flow-details are added in the P2P flow table and the destination-IP-table is updated. Although it is known that the port-based technique is inefficient in traffic classification, it has been used here for the purpose of performing an early classification of the P2P traffic, which may not be masquerading and still using well-known P2P port numbers [6,7] for communication.

Symmetry 2020, 12, 2117 10 of 22 port numbers (shown in Table1 in the previous section) used by various P2P applications. If a match is found, then its flow is classified as P2P. Accordingly, the flow-details are added in the P2P flow table and the destination-IP-table is updated. Although it is known that the port-based technique is inefficient in traffic classification, it has been used here for the purpose of performing an early classification of the P2P traffic, which may not be masquerading and still using well-known P2P port numbers [6,7] for communication.

4.2.2. Packet-Heuristic Based Classification The traffic which remains unclassified as P2P is fed to the packet-heuristic based classification technique. Here, the traffic flows are classified as P2P or non-P2P on the basis of the proposed heuristic rules. If a traffic flow satisfies any of the proposed heuristics, then it is classified as P2P flow, and this information is updated in the P2P flow table and destination-IP table, accordingly. The heuristics used in the proposed technique for classifying P2P traffic are discussed below: (1) Usage of ephemeral port numbers: In order to communicate over a network, an application makes use of the transport-layer port number. The port numbers below 1024 are called well-known privileged port numbers, whereas port numbers above 1024 are called ephemeral port numbers. It is observed that many P2P applications (e.g., BitTorrent, VoIP, etc.) use ephemeral port numbers, whereas non-P2P applications (e.g., web, , etc.) use well-known privileged port numbers for communication over the network. In client-server-based communication, the client uses an ephemeral port number (randomly chosen by the operating system) to communicate with the server and the server responds back with the requested data using a well-known port number. Therefore, if the source port and destination port of a packet are found to be ephemeral, then its flow is classified as P2P. However, this heuristic fails if a peer masquerades using the well-known port number (e.g., port 443 used by HTTPS) for communication. (2) Usage of TCP and UDP protocols simultaneously: It has been observed that most of the P2P applications such as Skype, Gnutella, etc. employ TCP and UDP protocols simultaneously for communication. Depending on the type of P2P application, TCP may be used for transferring the data, whereas UDP may be used for signaling messages and vice-versa [12,58]. For example, a Skype peer communicates with the super-peer using both TCP and UDP protocols. Therefore, if a source-IP uses TCP and UDP protocols simultaneously for communication with the destination-IP, then its flow is classified as P2P. However, some false positives may exist with this heuristic as there are some non-P2P applications such as streaming, IRC, gaming, etc. which exhibit a similar behavior [5]. (3) Communication with destination-IP which is already classified as P2P: Prior to the communication between peers, a peer waits for the incoming connections from the other peers with the help of a listening port [59]. Figure5 shows a scenario where peer-A (already classified as P2P) waits for incoming connections from the other peers. Its < IP, port > pair will act as the destination for all the other peers (i.e., peer-B, peer-C, peer-D, etc.) who want to communicate with it. Hence, the flows of all such peers are classified as P2P which communicate with the already classified P2P peer. For this purpose, we make use of the P2P destination-IP-table for storing < IP, port > pair information of those peers, which are already classified as P2P. While processing the packets, we analyze if either their source or destination < IP, port > pair maps with one of the records stored in the destination-IP-table, then the flows of such packets are also classified as P2P. (4) Usage of consecutive port numbers: It has been observed that various P2P applications actively make a number of connections with the other peers for communication. In this case, the operating system of a peer allocates successive port numbers to the application (where the first port is randomly chosen and allocated) [60]. Figure6 shows a scenario where the P2P source peer-A uses consecutive port numbers to communicate with the destination peers (i.e., peer-B, peer-C, peer-D, etc.). Therefore, we analyze that if a source-IP makes use of consecutive port numbers for communication, then its flows are classified as P2P. Symmetry 2020, 12, 2117 11 of 22 Symmetry 2020, 12, 2117 11 of 22

FigureFigure 5.5. Connection patternpattern ofof sourcesource peerspeers withwith thethe destinationdestination P2PP2P peer.peer.

FigureFigure 6.6. ConnectionConnection patternpattern ofof sourcesource P2PP2P peerpeer withwith thethe destinationdestination peers.peers. As various P2P applications communicate either via TCP or UDP (or both), it has been analyzed As various P2P applications communicate either via TCP or UDP (or both), it has been analyzed that the proposed heuristic rules work equally with both TCP and UDP traffic and are not affected even that the proposed heuristic rules work equally with both TCP and UDP traffic and are not affected if the traffic is encrypted. Algorithm 1 shows the packet-level classification process which classifies even if the traffic is encrypted. Algorithm 1 shows the packet-level classification process which P2P traffic as the first step. The traffic which remains un-classified as P2P is fed to the flow-level classifies P2P traffic as the first step. The traffic which remains un-classified as P2P is fed to the flow- classification process (i.e., second step). level classification process (i.e., second step).

Symmetry 2020, 12, 2117 12 of 22

Algorithm 1: Packet-level classification process (first step). Input: Network traffic packets Output: Traffic-flows classified as P2P and non-P2P

pkt: Packet ft: P2P_flow_table fi: Flow_information spn: Source_port_number dpn: Destination_port_number dit: Destination_IP_table wkP: Well_known_P2P_ports h1: Heuristic_1 h2: Heuristic_2 h3: Heuristic_3 h4: Heuristic_4

Begin (1) pkt = fetch_packet() (2) do (3) { (4) if (ft.contains (pkt.fi) (5) goto step 15 (6) else if (pkt.spn == wkP || pkt.dpn == wkP) (7) { (8) write: pkt.fi P2P → (9) update: dit pkt.fi ← (10) } (11) else if ((pkt.h1 || pkt.h2 || pkt.h3 || pkt.h4) == true) (12) write: pkt.fi P2P → (13) else (14) write: pkt.fi non-P2P → (15) pkt = fetch_packet() (16) } while (pkt ! = NULL) (17) goto 2nd step classification process End

4.3. Flow-Level Classification Process (Second Step) Figure7 shows the flow-level classification process, which combines the flow-heuristic based technique and statistical-based technique (using C4.5 decision tree). The traffic which remains unclassified as P2P in the packet-level classification process is fed to the flow-level classification process. Here, initially before processing a traffic-flow, its information is searched in the P2P flow table (which contains the records of the already classified P2P flows) in order to verify whether it is already classified as P2P flow or not. The flows which are not classified as P2P are fed to the flow-heuristic based classification process which is explained below. Symmetry 2020, 12, 2117 13 of 22 Symmetry 2020, 12, 2117 13 of 22

Figure 7. Flow-level classificationclassification process (second step).

4.3.1. Flow-Heuristic Based ClassificationClassification One ofof thethe properties properties of of the the P2P P2P application application is thatis that it acts it acts asboth as both a client a client and and server server at the at same the same time, i.e.,time, data i.e., are data transferred are transferred from destination-to-source from destination-to-source and source-to-destination and source-to-destination simultaneously. simultaneously. A similar behaviorA similar can behavior be detected can be in client-serverdetected in client-server applications applications as well, where as data well, are where transferred data are from transferred the client tofrom a server the client with requestto a server messages with request and the messages server responds and the with server the requestedresponds with data. the However, requested the maindata. diHowever,fference isthe that main the difference amount ofis datathat the sent amount from the of clientdata sent to a from server the (i.e., client request to a server messages) (i.e., request is very smallmessages) compared is very to small the amount compared of data to the sent amount from the of server data sent to a clientfrom (i.e.,the server data requested). to a client However,(i.e., data inrequested). the case of However, the P2P application,in the case of the the data P2P are application, sent in both the directions data are sent (i.e., in from both source-to-destination directions (i.e., from andsource-to-destination destination-to-source) and destination-to-source) in a large amount. Therefore, in a large we amount. analyze Therefore, that if in we a flow, analyze the amountthat if in of a dataflow,sent the amount in each directionof data sent (i.e., in destination-to-sourceeach direction (i.e., destination-to-source and source-to-destination) and source-to-destination) is greater than the threshold-value,is greater than the then threshold-value, the flow is classified then the asflow P2P. is Forclassified experimental as P2P. For purposes, experimental the threshold-value purposes, the takenthreshold-value here is 3 MB. taken here is 3 MB.

4.3.2. Statistical Based ClassificationClassification The tratraffic-flowsffic-flows which still remain unclassifiedunclassified as P2P (in the previous process) are fed to thethe statistical-based classificationclassification process,process, where statistical features of the tratraffic-flowsffic-flows are extracted and used with the C4.5 ML algorithm to classify the remainingremaining tratrafficffic (as(as shownshown inin FigureFigure7 7).). ThisThis processprocess involvesinvolves thethe trainingtraining phasephase asas wellwell asas thethe classificationclassification phase.phase. In thethe trainingtraining phase,phase, aa classificationclassification model is built using the training dataset which containscontains both P2P and non-P2P tratraffic-flows.ffic-flows. The ML algorithm analyzes the relationship between the flowflow features and the output class value to generate a classifierclassifier model,model, which predictspredicts thethe typetype ofof tratrafficffic flowflow byby analysinganalysing itsits statisticalstatistical features.features. In thethe classificationclassification phase, statistical features of a traffic traffic flowflow are extracted and fed into the classifier classifier model. model. IfIf thethe characteristics characteristics of of a flowa flow matches matches the distinctthe distinct characteristics characteristics of P2P of tra P2Pffic, thentraffic, the then flow isthe classified flow is asclassified P2P. as P2P. TraTraffic-flowffic-flow features are are the the numeric numeric values values calc calculatedulated over over numerous numerous packets packets belonging belonging to that to thatflow. flow. The Theflow flow features features which which are areused used with with th thee ML ML algorithm algorithm in in the the proposed proposed technique are mentioned below: • Packet inter-arrival time from source-to-destination

Symmetry 2020, 12, 2117 14 of 22

Packet inter-arrival time from source-to-destination • Packet inter-arrival time from destination-to-source • Duration of flow • Total number of packets from source-to-destination • Total number of packets from destination-to-source • Total number of bytes of all packets • Total packet bytes from source-to-destination • Total packets bytes from destination-to-source • Payload size of packets from source-to-destination • Payload size of packets from destination-to-source • These flow features have been mostly used in previous studies [20], as well. They are given as input to the ML algorithm to build a statistical-based classifier for performing the classification. The C4.5 ML algorithm is chosen for traffic classification purposes, since it is faster and better compared to other ML algorithms [61]. Algorithm 2 shows the flow-level classification process (i.e., second step) which classifies P2P traffic at a flow level.

Algorithm 2: Flow-level classification process (second step). Input: Traffic-flows classified as non-P2P in the first step Output: Traffic-flows classified as P2P and non-P2P

flw: Flow ft: P2P_flow_table fi: Flow_information std: Data_transferred_from_source_to_destination dts: Data_transferred_from_destination_to_source thld: Data_threshold (3MB) fh: Flow_heuristic = ((std + dts) > thld) ff: Flow_features MLA: Machine_learning_algorithm rst: Result

Begin (1) flw = fetch_flow() (2) do (3) { (4) if (ft.contains (flw.fi)) (5) goto step 17 (6) else if (flw.fh == true) (7) write: flw P2P → (8) else (9) { (10) fset = flw.ff (11) rst = flw.MLA (fset) (12) if (rst == “P2P”) (13) write: flw P2P → (14) else (15) write: flw non-P2P → (16) } (17) flw = fetch_flow() (18) }while (flw ! = NULL) End Symmetry 2020, 12, 2117 15 of 22

5. Verification

5.1. Evaluation Metrics The performance of a classifier can be characterized using the metrics known as: False positive (FP), false negative (FN), true positive (TP), and true negative (TN). They are described as follows:

(1) TP: Percentage of instances correctly categorized as belonging to a particular class. (2) TN: Percentage of instances correctly categorized as not belonging to a particular class. (3) False positive (FP): Percentage of instances incorrectly categorized as belonging to a particular class. (4) FN: Percentage of instances incorrectly categorized as not belonging to a particular class.

The proposed technique classifies the traffic flow as P2P or non-P2P. Accuracy (1), recall (2), and precision (3) metrics are used to evaluate the proposed methodology. Accuracy is used to measure the capability of the classifier for identifying negative and positive cases. Recall is used to measure the overall percentage of correctly classified cases. Precision is used to measure the percentage of correctly classified positive cases. They are defined as follows:

TP + TN Accuracy = (1) TP + TN + FP + FN

TP Recall = (2) TP + FN TP Precision = (3) TP + FP

5.2. Datasets, Validation, and Experimental Results To evaluate the proposed technique, two offline traffic datasets have been used, which are realistic, and consist of both P2P and non-P2P flows as shown in Table2.

Table 2. The number of flows in the datasets.

Dataset P2P (No. of Flows) Non-P2P (No. of Flows) Total Dataset-1 20,617 48,179 68,796 Dataset-2 3881 2892 6773

The first traffic dataset (i.e., Dataset-1) is UNIBS [62,63] which belongs to the University of Brescia and the second traffic dataset (i.e., Dataset-2) is collected at the campus area network in a controlled environment using the Wireshark [64] tool and their pattern of communication was observed. Therefore, the flows which belong to the P2P traffic are well-known in advance. In addition, such traffic flows are labelled accordingly with actual applications for the purpose of ground-truth verification, which consist of traffic traces of different application protocols, for example, HTTP, SMTP, BitTorrent, Skype, Dropbox, DNS, FTP, POP3, IMAP, etc., as shown in Table3.

Table 3. Summary of the collected data.

Protocol Packets Bytes POP3 13,647 918,878 IMAP 3191 213,554 HTTP 1,399,230 92,060,704 BitTorrent 379,836 329,477,265 Symmetry 2020, 12, 2117 16 of 22

Table 3. Cont.

SSH 2,586,027 141,334,606 RTMP 11,712 779,616

Symmetry 2020, 12, xDropbox FOR PEER REVIEW 6498 429,308 16 of 23 StarCraft 7 394 BitTorrent 379,836 329,477,265 FTP_CONTROLSSH 2 19,586,027 141,334,606 1274 TelnetRTMP 11 90,712 779,616 6132 Dropbox 6498 429,308 SOCKS 2487 139,650 StarCraft 7 394 SkypeFTP_CONTROL 19 30 1274 3657 OthersTelnet 402,35790 6132 26,674,298 SOCKS 2487 139,650 Skype 30 3657 We made the training and testingOthers dataset by402 combining,357 26, both674,298 datasets, as shown in Table2. In the statistical-based classification process, the datasets were divided into training and testing parts using We made the training and testing dataset by combining both datasets, as shown in Table 2. In the k-foldthe cross-validation statistical-based classification procedure. process, Nowadays, the datasets most were of the divided communication into training and between testing parts the peers over the networkusing is encryptedthe k-fold cross to- providevalidation security procedure. or Nowadays, to obfuscate most theof the tra communicationffic. Therefore, between for experimental the purposes, Dataset-2peers over the was network constructed is encrypted with t theo provide encrypted security P2P or trato obfuscateffic to test the the traffic. classification Therefore, performancefor of the proposedexperimental hybrid purpose technique.s, Dataset The-2 was results constructed show with that the the encrypted proposed P2P hybrid traffic technique to test the achieves classification performance of the proposed hybrid technique. The results show that the proposed overall accuracy,hybrid technique recall, and achieves precision overall valuesaccuracy, ranging recall, and between precision 97.4–98.3%,values ranging 97.9–98.4%, between 97.4– and98.3%, 95.9–97.6%, respectively97.9 (as–98.4% shown, and 95.9 in Figure–97.6%, 8respectively), which also(as shown show in thatFigure it 8) is, ablewhich toalso classify show that the it encryptedis able to tra ffic. Figures9 andclassify 10 theshow encrypted that the traffic. classification Figures 9 and accuracy 10 show that achieved the classification by the accuracy proposed achieved hybrid by techniquethe is higher thanproposed the various hybrid existingtechnique hybrid,is higher asthan well the asvarious non-hybrid existing hybrid P2P tra, asffi wellc classification as non-hybrid techniques.P2P traffic classification techniques.

Classification performance 100 99 98 98.4 97 97.9 98.3 97.6 97.4 96 95 95.9

94 Percentage (%) Percentage 93 92 Precision Recall Accuracy Dataset-1 95.9 98.4 98.3 Dataset-2 97.6 97.9 97.4

Figure 8. Classification performance of the proposed hybrid technique. Symmetry 2020Figure, 12, x FOR 8. PEERClassification REVIEW performance of the proposed hybrid technique. 17 of 23

Performance comparison with existing hybrid techniques 100 99 98 98.19 98.3 97 97.46 97.7 96 96.03 96 95 95.67

Percentage (%) Percentage 94 95 94.4 93 92 P2P Novel P2P P2P Hybrid 2-step P2P P2P network self- heuristics LASER classify in Proposed classificati classify similarity processors learning & ML [55] 2-stages technique on [52] [19] [22] [53] [5] [20] [21] Accuracy 95.67 96.03 95 97.46 98.19 96 97.7 94.4 98.3

Figure 9.FigureAccuracy 9. Accuracy comparison comparison of of various various hybrid P2P P2P traffic tra fficlassificationc classification techniques. techniques.

Performance comparison with non-hybrid techniques 99

97 98.03 98.3 95 95 93 93.9 93.07 91 91.34 91.8 90.5 Percentage Percentage (%) 89

87 87.89 85 SVM & Multi- P2P Increment Active P2P Transfer Bat GA-WK- stage Proposed identify al SVM learning clustering learning inspired ELM [41] identify technique [33] [37] [30] [14] [42] [15] [16] Accuracy 93.9 87.89 93.07 90.5 91.34 95 98.03 91.8 98.3

Figure 10. Accuracy comparison of proposed hybrid technique with existing non-hybrid techniques.

In the proposed hybrid technique, after analyzing the type of protocol (i.e., TCP or UDP) used by packets for communication, either TCP or UDP port numbers are extracted from packet headers to perform the P2P-port based classification. In both packet-heuristic and flow-heuristic based classifications, the proposed heuristic rules analyze behavior/communication patterns of traffic, which are not affected whether a flow uses TCP or UDP protocol for communication. At last, the statistical-based classifier uses various statistical features of traffic (with C4.5 decision tree) to perform the classification, which are independent of traffic using the TCP or UDP protocol for communication. Hence, the overall proposed hybrid technique is able to work with both TCP and UDP protocols at every step. In addition, it also involves less computation, since it does not rely on the DPI technique (which requires a large amount of computation for inspecting the traffic) to perform the classification, but rather relies on heuristic-based and statistical-based techniques which are comparatively light on resources [6]. Furthermore, the classification performance of the proposed hybrid technique at various stages is shown in Table 4. It can be seen that the packet-level process (which is a combination of P2P port- based and heuristic-based techniques) achieves an accuracy of 90.50% in classifying P2P traffic. When it is combined with the flow-level process (which is a combination of flow-heuristic and statistical-

Symmetry 2020, 12, x FOR PEER REVIEW 17 of 23

Performance comparison with existing hybrid techniques 100 99 98 98.19 98.3 97 97.46 97.7 96 96.03 96 95 95.67

Percentage (%) Percentage 94 95 94.4 93 92 P2P Novel P2P P2P Hybrid 2-step P2P P2P network self- heuristics LASER classify in Proposed classificati classify similarity processors learning & ML [55] 2-stages technique on [52] [19] [22] [53] [5] [20] [21] Accuracy 95.67 96.03 95 97.46 98.19 96 97.7 94.4 98.3 Symmetry 2020, 12, 2117 17 of 22 Figure 9. Accuracy comparison of various hybrid P2P traffic classification techniques.

Performance comparison with non-hybrid techniques 99

97 98.03 98.3 95 95 93 93.9 93.07 91 91.34 91.8 90.5 Percentage Percentage (%) 89

87 87.89 85 SVM & Multi- P2P Increment Active P2P Transfer Bat GA-WK- stage Proposed identify al SVM learning clustering learning inspired ELM [41] identify technique [33] [37] [30] [14] [42] [15] [16] Accuracy 93.9 87.89 93.07 90.5 91.34 95 98.03 91.8 98.3

FigureFigure 10. Accuracy 10. Accuracy comparison comparison of of proposed proposed hybridhybrid technique with with existing existing non non-hybrid-hybrid techniques. techniques.

In theIn the proposed proposed hybrid hybrid technique,technique, after after analyzing analyzing the thetype typeof protocol of protocol (i.e., TCP (i.e., or UDP) TCP used or UDP) usedby by packets packets for for communication communication,, either either TCP or TCP UDP or port UDP numbers port numbersare extracted are from extracted packet fromheaders packet to perform the P2P-port based classification. In both packet-heuristic and flow-heuristic based headers to perform the P2P-port based classification. In both packet-heuristic and flow-heuristic classifications, the proposed heuristic rules analyze behavior/communication patterns of traffic, based classifications, the proposed heuristic rules analyze behavior/communication patterns of which are not affected whether a flow uses TCP or UDP protocol for communication. At last, the traffic,statistical which- arebased not classifier affected use whethers various a flow statistical uses TCP features or UDP of traffic protocol (with for C4.5 communication. decision tree) Atto last, the statistical-basedperform the classification, classifier which uses various are independent statistical of features traffic using of tra theffi cTCP (with or C4.5 UDP decision protocol fortree) to performcommunication. the classification, Hence, the which overall are proposed independent hybrid oftechnique traffic usingis able theto work TCP with or UDPboth TCP protocol and for communication.UDP protocols Hence, at every the step. overall In addition, proposed it also hybrid involves technique less computation, is able to since work it does with not both rely TCP on and UDPthe protocols DPI technique at every (which step. requires In addition, a large it alsoamount involves of computation less computation, for inspecting since the it traffic) does not to rely on theperform DPI technique the classification (which, but requires rather relies a large on heuristic amount-based of computation and statistical for-based inspecting techniques the which traffi c) to performare comparatively the classification, light buton resources rather relies [6]. on heuristic-based and statistical-based techniques which Furthermore, the classification performance of the proposed hybrid technique at various stages are comparatively light on resources [6]. is shown in Table 4. It can be seen that the packet-level process (which is a combination of P2P port- Furthermore, the classification performance of the proposed hybrid technique at various stages is based and heuristic-based techniques) achieves an accuracy of 90.50% in classifying P2P traffic. When shownit is in combined Table4. It with can bethe seen flow that-level the process packet-level (which is process a combination (which isof a flow combination-heuristic and of P2P statistical port-based- and heuristic-based techniques) achieves an accuracy of 90.50% in classifying P2P traffic. When it is combined with the flow-level process (which is a combination of flow-heuristic and statistical-based techniques), then the classification accuracy reaches 98.30%. This can be attributed to the fact that some P2P applications use masquerading techniques or hide their traffic behind well-known port numbers (which could not be classified in the packet-level process) and hence such traffic is classified using the flow-level classification process.

Table 4. Classification performance at various steps (P P2P-port-based, PH packet-heuristic-based, → → FH flow-heuristic-based, S statistical-based). → → Classification Process Accuracy (%) P 11.90 P + PH 90.50 P + PH + FH 95.10 P + PH + FH + S 98.30

In Table4, it can be seen that although the P2P-port based technique (i.e., P) is ine fficient in classification, it has been utilized here since it is the fastest method to classify traffic if it does not masquerade and use well-known P2P port numbers [6,7] for communication. Therefore, Symmetry 2020, 12, 2117 18 of 22 its main purpose is to reduce the amount of traffic that needs to be analyzed by heuristic-based techniques (i.e., PH and FH) if it classifies some P2P traffic at an early stage. The advantage of using the heuristic-based technique (i.e., PH and FH) is that it classifies traffic based on its behavior/communication pattern and does not require much computation for the analysis compared to DPI statistical-based techniques. Finally, the advantage of using the statistical-based classifier (i.e., S) in the proposed technique is that it classifies any remaining P2P traffic which could not be identified by heuristics (i.e., PH and FH), where such P2P traffic may escape detection (from heuristics) using some masquerading technique, or may belong to an application which is newly emerged and has an entirely different (or new) communication pattern. However, as the statistical-based classifier performs the classification on the basis of various statistical features of traffic, therefore, its limitation is that the model needs to be trained (and updated accordingly) to identify new applications (which require some time). For example, a new P2P application with a communication pattern similar to existing P2P applications but having different traffic statistics may be classified incorrectly by the classification-model until it is re-trained. Table5 shows the summary regarding the approach used in the first and second step classification process along with the classification accuracy of various existing hybrid P2P traffic classification techniques.

Table 5. Comparison of hybrid P2P traffic classification techniques.

Second Step Ref Studies First Step Classification Accuracy (%) Classification Static feature-based (port, signature, Flexible neural [53] Chen et al. (2009) 95.65 pattern) tree-based Fine-grained [52] Li et al. (2009) Coarse-grained (C4.5 decision tree) 96.03 (signature, port) [5] Keralapura et al. (2010) TCM pattern signature 95.00 [19] Ye and Cho (2013) Signature, connection heuristics C4.5 decision tree 97.46 REPTree, pattern [20] Ye and Cho (2014) Signature, connection heuristics 98.19 heuristics [55] Sajeev and Nair (2016) Signature C4.5 decision tree 96.00 At packet-level: Signature, connection heuristics [21] Ye and Cho (2017) REPTree 97.70 At flow-level: REPTree, pattern heuristics Port filtering, DNS query filtering, [22] Khan et al. (2019) Decision tree 94.40 flow counting Flow-level heuristics, Proposed hybrid technique port, packet-level heuristics 98.30 C4.5 decision tree

During the classification process, the techniques used in [5,19–21,52,55], rely on the signature-based approach, which is computationally expensive [6,30,31] and has various other limitations as discussed in Section2. In addition, the techniques in [ 19–21] do not classify the UDP traffic. The technique used in [53] relies on dedicated hardware for the P2P classification, whereas the technique used in [22] may lead to many false negatives since during the classification process, it filters out all the traffic using well-known port numbers (such as 20, 21, 443, etc.) by considering them as non-P2P traffic. The hybrid technique proposed in this paper not only achieves high P2P classification accuracy, but also involves less computation since unlike existing various hybrid techniques (mentioned above), it does not rely on the signature-based technique which is computationally expensive and unsuitable for high-speed networks [6,30,31], but rather relies on heuristic-based and statistical-based techniques, which are comparatively light on resources. In addition, the proposed hybrid technique works with both TCP and UDP traffic flows and classifies the encrypted traffic, as well. Symmetry 2020, 12, 2117 19 of 22

6. Conclusions P2P applications have been widely used since the past decade and bring a lot of conveniences, but pose various issues to the ISPs and enterprises in the tasks related to providing QoS for various applications, addressing network congestion, security, etc. Conventional techniques for traffic classification such as port-based and payload-based are ineffective in classifying P2P traffic due to various limitations associated with them. Therefore, modern techniques need to be adopted for classifying P2P traffic with high accuracy, which will allow ISPs or network administrators to either limit or ban P2P traffic in order to maintain a Quality of Service for various applications in their network. In this work, we propose the multi-level P2P traffic classification technique which is sub-divided into the packet-level and flow-level classification process. By analyzing the behavior of various P2P applications, some heuristic rules have been proposed to classify P2P traffic and are utilized in both the packet-level and flow-level classification process. If the traffic remains unclassified as P2P, then it undergoes further analysis using statistical-features of traffic which are used with the C4.5 decision tree to classify traffic as P2P or non-P2P.The experiments performed using the proposed hybrid technique achieved high classification accuracy of 98.30%, which is higher than other hybrid/non-hybrid techniques as it combines the benefits of both heuristic-based (less computation as compared to DPI) as well as statistical-based (scalability) techniques. In addition, it also works with both TCP and UDP traffic and is not affected even if the traffic is encrypted. However, there are certain limitations of the proposed technique:

(1) It may produce some false positives (during the P2P-port based classification process) if the network traffic includes malicious applications using well-known P2P default ports that can be utilized by various P2P applications. (2) It does not perform a fine-grained classification to classify P2P traffic into specific applications. (3) It is made to work on offline datasets which consist of a limited number of P2P and non-P2P applications and hence, may produce some false positives (due to one of the proposed heuristics discussed in Section3) when tra ffic datasets involve all kinds of P2P and non-P2P application protocols.

Hence, in the future, we plan to enhance the proposed technique, which can perform a fine-grained P2P classification (i.e., identify P2P application traffic specifically), as well. In addition, broader traffic datasets (containing traffic traces of various popular P2P and non-P2P applications) would be used for analyzing the effectiveness of the technique.

Author Contributions: Conceptualization, M.B. and V.S.; methodology, M.B. and V.S.; software, M.B.; validation, P.S. and M.M.; formal analysis, M.B. and P.S.; investigation, V.S. and M.M.; resources, M.B.; data curation, M.B. and V.S.; writing–original draft preparation, M.B., P.S. and M.M.; writing–review and editing, M.B., V.S. and M.M.; visualization, M.B. and P.S.; supervision, V.S. and M.M.; project administration, P.S. and M.M.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript. Funding: Taif University Researchers Supporting Project number (TURSP-2020/10), Taif University, Taif, Saudi Arabia. Acknowledgments: We are very thankful for the support from Taif University Researchers Supporting Project (TURSP-2020/10). Conflicts of Interest: The authors declare no conflict of interest.

References

1. Mohammadi, M.; Raahemi, B.; Akbari, A.; Moeinzadeh, H.; Nasersharif, B. Genetic-based minimum classification error mapping for accurate identifying Peer-to-Peer applications in the internet traffic. Expert Syst. Appl. 2011, 38, 6417–6423. [CrossRef] 2. Sen, S.; Wang, J. Analyzing peer-to-peer traffic across large networks. ACM J. 2002, 12, 137–150. Symmetry 2020, 12, 2117 20 of 22

3. Dai, L.; Yang, J.; Lin, L. A comprehensive system for P2P classification. In Proceedings of the 2010 2nd IEEE International Conference on Network Infrastructure and Digital Content, Beijing, China, 24–26 September 2010. 4. Chu, H.; Yi, H.; Zhang, X. A new P2P traffic identification methodology based on flow statistics. In Proceedings of the 2011 IEEE 3rd International Conference on Communication Software and Networks, Xi’an, China, 27–29 May 2011. 5. Keralapura, R.; Nucci, A.; Chuah, C.-N. A novel self-learning architecture for p2p traffic classification in high speed networks. Comput. Netw. 2010, 54, 1055–1068. [CrossRef] 6. Gomes, J.V.; Inácio, P.R.M.; Pereira, M.; Freire, M. Detection and classification of peer-to-peer traffic: A survey. ACM Comput. Surv. 2013, 45.[CrossRef] 7. Bhatia, M.; Kumar, R.M. Identifying P2P traffic: A survey. Peer-to-Peer Netw. Appl. 2016, 10, 1182–1203. [CrossRef] 8. Karagiannis, T.; Papagiannaki, K.; Faloutsos, M. BLINC: Multilevel traffic classification in the dark. ACM SIGCOMM Comput. Commun. Rev. 2005, 35, 229–240. [CrossRef] 9. Turkett, W.H., Jr.; Karode, A.V.; Fulp, E.W. In-the-dark network traffic classification using support vector machines. AAAI 2008, 3, 1745–1750. 10. Global Internet Phenomena, Sandvine. 2019. Available online: https://www.sandvine.com/phenomena (accessed on 10 February 2020). 11. Controlling P2P Traffic. 2003. Available online: https://www.lightreading.com/controlling-p2p-traffic/d/d-id/ 598203&page_number=2 (accessed on 21 September 2020). 12. Reddy, J.M.; Hota, C. Heuristic-Based Real-Time P2P Traffic Identification. In Proceedings of the 2015 International Conference on Emerging Information Technology and Engineering Solutions, Pune, India, 20–21 February 2015. 13. Bozdogan, C.; Gokcen, Y.; Zincir, I. A Preliminary Investigation on the Identification of Peer to Peer Network Applications. In Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference—GECCO Companion ’15, Madrid, Spain, 11–15 July 2015. 14. Tseng, C.-M.; Huang, G.-T.; Liu, T.-J. P2P traffic classification using clustering technology. In Proceedings of the 2016 IEEE/SICE International Symposium on System Integration (SII), Sapporo, Japan, 13–15 December 2016. 15. Chuan, L.; Wang, C.; Jixiong, H.; Ye, Z. Peer to peer traffic identification using support vector machine and bat-inspired optimization algorithm. In Proceedings of the 2017 12th International Conference on Computer Science and Education (ICCSE), Houston, TX, USA, 22–25 August 2017. 16. Ali, B.M.; Jamil, H.A.; Hamdan, M.; Bassi, J.S.; Ismail, I.; Marsono, M.N. Multi-stage feature selection for on-line flow peer-to-peer traffic identification. In Proceedings of the Asian Simulation Conference, Melaka, Malaysia, 27–29 August 2017. 17. Jamil, H.A.; Ali, B.M.; Hamdan, M.; Osman, A.E. Online P2P Internet Traffic Classification and Mitigation Based on Snort and ML. Eur. J. Eng. Res. Sci. 2019, 4, 131–137. [CrossRef] 18. Nazari, Z.; Noferesti, M.; Jalili, R. DSCA: An inline and adaptive application identification approach in encrypted network traffic. In Proceedings of the 3rd International Conference on Cryptography, Security and Privacy, Kuala Lumpur, Malaysia, 19–21 January 2019. 19. Ye, W.; Cho, K. Two-Step P2P Traffic Classification with Connection Heuristics. In Proceedings of the 2013 Seventh International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, Taichung, Taiwan, 3–5 July 2013. 20. Ye, W.; Cho, K. Hybrid P2P traffic classification with heuristic rules and machine learning. Soft Comput. 2014, 18, 1815–1827. [CrossRef] 21. Ye, W.; Cho, K. P2P and P2P botnet traffic classification in two stages. Soft Comput. 2015, 21, 1315–1326. [CrossRef] 22. Khan, R.U.; Kumar, R.; Alazab, M.; Zhang, X. A Hybrid Technique To Detect , Based on P2P Traffic Similarity. In Proceedings of the 2019 Cybersecurity and Cyberforensics Conference (CCC), Melbourne, Australia, 8–9 May 2019. 23. Service Name and Transport Protocol Port Number Registry. Available online: https://www.iana.org/ assignments/service-names-port-numbers/service-names-port-numbers.xhtml (accessed on 11 July 2020). Symmetry 2020, 12, 2117 21 of 22

24. Roughan, M.; Sen, S.; Spatscheck, O.; Duffield, N. Class of service mapping for QoS: A statistical signature-based approach to IP traffic classification. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, Taormina Sicily, Italy, 25–27 October 2004. 25. Zuev, D.; Moore, A.W. Traffic classification using a statistical approach. In International Workshop on Passive and Active Network Measurement; Springer: Berlin/Heidelberg, Germany, 2005; pp. 321–324. 26. Moore, A.W.; Papagiannaki, K. Toward the accurate identification of network applications. In International Workshop on Passive and Active Network Measurement; Springer: Berlin/Heidelberg, Germany, 2005; pp. 41–54. 27. Karagiannis, T.; Broido, A.; Brownlee, N.; Claffy, K.C.; Faloutsos, M. Is p2p dying or just hiding? [p2p traffic measurement]. In Proceedings of the IEEE Global Telecommunications Conference, Dallas, TX, USA, 29 November–3 December 2004. 28. Madhukar, A.; Williamson, C. A longitudinal study of P2P traffic classification. In Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation, Monterey, CA, USA, 11–14 September 2006. 29. Karagiannis, T.; Broido, A.; Brownlee, N.; Claffy, K.; Faloutsos, M. File-Sharing in the Internet: A Characterization of P2P Traffic in the Backbone; University of California: Riverside, CA, USA, 2003. 30. Liu, S.-M.; Sun, Z. Active learning for P2P traffic identification. Peer-to-Peer Netw. Appl. 2015, 8, 733–740. [CrossRef] 31. Gomes, J.V.P.; Inácio, P.R.M.; Freire, M.M.; Pereira, M.; Monteiro, P.P. Analysis of Peer-to-Peer Traffic Using a Behavioural Method Based on Entropy. In Proceedings of the 2008 IEEE International Performance, Computing and Communications Conference, Austin, TX, USA, 7–9 December 2008. 32. Perényi, M.; Dang, T.D.; Gefferth, A.; Molnár, S. Identification and Analysis of Peer-to-Peer Traffic. J. Commun. 2006, 1, 36–46. [CrossRef] 33. Yan, J.; Wu, Z.; Luo, H.; Zhang, S. P2P Traffic Identification Based on Host and Flow Behaviour Characteristics. Cybern. Inf. Technol. 2013, 13, 64–76. [CrossRef] 34. Wang, D.; Zhang, L.; Yuan, Z.; Xue, Y.; Dong, Y. Characterizing Application Behaviors for classifying P2P traffic. In Proceedings of the 2014 International Conference on Computing, Networking and Communications (ICNC), Honolulu, HI, USA, 3–6 February 2014. 35. Zhang, Q.; Ma, Y.; Zhang, P.; Wang, J.; Li, X. Netflow based P2P detection in UDP traffic. In Proceedings of the Fifth International Conference on Intelligent Control and Information Processing, Dalian, China, 18–20 August 2014. 36. Sun, M.-F.; Chen, J.-T. Research of the traffic characteristics for the real time online traffic classification. J. China Univ. Posts Telecommun. 2011, 18, 92–98. [CrossRef] 37. Gong, J.; Wang, W.; Wang, P.; Sun, Z. P2P traffic identification method based on an improvement incremental SVM learning algorithm. In Proceedings of the 2014 International Symposium on Wireless Personal Multimedia Communications (WPMC), Sydney, NSW, Australia, 7–10 September 2014. 38. Deng, S.; Luo, J.; Liu, Y.; Wang, X.; Yang, J. Ensemble learning model for P2P traffic identification. In Proceedings of the 2014 11th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), Xiamen, China, 19–21 August 2014. 39. Qin, T.; Wang, L.; Zhao, D.; Zhu, M. CUFTI: Methods for core users finding and traffic identification in P2P systems. Peer-to-Peer Netw. Appl. 2015, 9, 424–435. [CrossRef] 40. He, J.; Yang, Y.; Qiao, Y.; Deng, W.-P.Fine-grained P2P traffic classification by simply counting flows. Front. Inf. Technol. Electron. Eng. 2015, 16, 391–403. [CrossRef] 41. Ertam, F.; Avcı, E. A new approach for internet traffic classification: GA-WK-ELM. Measurement 2017, 95, 135–142. [CrossRef] 42. Sun, G.; Liang, L.; Chen, T.; Xiao, F.; Lang, F. Network traffic classification based on transfer learning. Comput. Electr. Eng. 2018, 69, 920–927. [CrossRef] 43. Lim, H.-K.; Kim, J.-B.; Heo, J.-S.; Kim, K.; Hong, Y.-G.; Han, Y.-H. Packet-based network traffic classification using deep learning. In Proceedings of the 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Okinawa, Japan, 11–13 February 2019. 44. Park, S.; Chung, H.; Lee, C.; Lee, S.; Lee, K. Methodology and implementation for tracking the file sharers using BitTorrent. Multimedia Tools Appl. 2013, 74, 271–286. [CrossRef] Symmetry 2020, 12, 2117 22 of 22

45. Cruz, M.; Ocampo, R.; Montes, I.; Atienza, R. Fingerprinting BitTorrent Traffic in Encrypted Tunnels Using Recurrent Deep Learning. In Proceedings of the 2017 Fifth International Symposium on Computing and Networking (CANDAR), Aomori, Japan, 19–22 November 2017. 46. Jiang, Q.; Hu, H.; Hu, G. Real-Time Identification of Users under the New Structure of Skype. In Proceedings of the 2016 IEEE International Conference on Sensing, Communication and Networking (SECON Workshops), London, UK, 27–27 June 2016. 47. Munir, S.; Majeed, N.; Babu, S.; Bari, I.; Harry, J.; Masood, Z.A. A joint port and statistical analysis based technique to detect encrypted VoIP traffic. Int. J. Comput. Sci. Inf. Secur. 2016, 14, 117. 48. Lee, S.-H.; Goo, Y.-H.; Park, J.-T.; Ji, S.-H.; Kim, M.-S. Sky-Scope: Skype application traffic identification system. In Proceedings of the 2017 19th Asia-Pacific Network Operations and Management Symposium (APNOMS), Seoul, Korea, 27–29 September 2017. 49. Di Mauro, M.; Di Sarno, C. Improving SIEM capabilities through an enhanced probe for encrypted Skype traffic detection. J. Inf. Secur. Appl. 2018, 38, 85–95. [CrossRef] 50. Saqib, N.A.; Shakeel, Y.; Khan, M.A.; Mahmood, H.; Zia, M. An effective empirical approach to VoIP traffic classification. Turk. J. Electr. Eng. Comput. Sci. 2017, 25, 888–900. [CrossRef] 51. Wang, R.; Zhang, J.; Zhang, Y.; Yin, M.; Xu, J. A Real-Time Identification System for VoIP Traffic in Large-Scale Networks. In Proceedings of the 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 10–12 August 2019. 52. Li, J.; Zhang, S.; Lu, Y.; Yan, J. Hybrid internet traffic classification technique. J. Electron. 2009, 26, 101–112. [CrossRef] 53. Chen, Z.; Yang, B.; Chen, Y.; Abraham, A.; Gro¸san,C.; Peng, L. Online hybrid traffic classifier for Peer-to-Peer systems based on network processors. Appl. Soft Comput. 2009, 9, 685–694. [CrossRef] 54. Nair, L.M.; Sajeev, G.P. Internet Traffic Classification by Aggregating Correlated Decision Tree Classifier. In Proceedings of the 2015 Seventh International Conference on Computational Intelligence, Modelling and Simulation (CIMSim), Kuantan, Malaysia, 27–29 July 2015. 55. Sajeev, G.P.; Nair, L.M. LASER: A novel hybrid peer to peer network traffic classification technique. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016. 56. jNetPcap. Available online: https://sourceforge.net/projects/jnetpcap/ (accessed on 10 August 2020). 57. Weka. Available online: https://www.cs.waikato.ac.nz/ml/weka (accessed on 10 August 2020). 58. Velan, P.; Cermˇ ák, M.; Celeda,ˇ P.; Drašar, M. A survey of methods for encrypted traffic classification and analysis. Int. J. Netw. Manag. 2015, 25, 355–374. [CrossRef] 59. Karagiannis, T.; Broido, A.; Faloutsos, M.; Claffy, K. Transport layer identification of P2P traffic. In Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement, Taormina Sicily, Italy, 25–27 October 2004. 60. Lu, C.-N.; Huang, C.-Y.; Lin, Y.-D.; Lai, Y.-C. Session level flow classification by packet size distribution and session grouping. Comput. Netw. 2012, 56, 260–272. [CrossRef] 61. Nguyen, T.T.T.; Armitage, G. A survey of techniques for internet traffic classification using machine learning. IEEE Commun. Surv. Tutor. 2008, 10, 56–76. [CrossRef] 62. Gringoli, F.; Salgarelli, L.; Dusi, M.; Cascarano, N.; Risso, F. Gt: Picking up the truth from the ground for internet traffic. ACM SIGCOMM Comput. Commun. Rev. 2009, 39, 12–18. [CrossRef] 63. Dusi, M.; Gringoli, F.; Salgarelli, L. Quantifying the accuracy of the ground truth associated with Internet traffic traces. Comput. Netw. 2011, 55, 1158–1167. [CrossRef] 64. Wireshark. Available online: https://www.wireshark.org (accessed on 10 August 2020).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).