Soft Computing manuscript No. (will be inserted by the editor)
Deep Packet: A Novel Approach For Encrypted Traffic Classification Using Deep Learning
Mohammad Lotfollahi · Mahdi Jafari Siavoshani · Ramin Shirali Hossein Zade · Mohammdsadegh Saberian
Received: date / Accepted: date
Abstract Internet traffic classification has become more Keywords Application Identification · Traffic important with rapid growth of current Internet net- Characterization · Deep Learning · Convolutional work and online applications. There have been numerous Neural Networks · Stacked Autoencoder · Deep Packet. studies on this topic which have led to many different approaches. Most of these approaches use predefined features extracted by an expert in order to classify net- 1 Introduction work traffic. In contrast, in this study, we propose a deep learning based approach which integrates both fea- Network traffic classification is an important task in ture extraction and classification phases into one system. modern communication networks [5]. Due to the rapid Our proposed scheme, called “Deep Packet,” can han- growth of high throughput traffic demands, to properly dle both traffic characterization in which the network manage network resources, it is vital to recognize dif- traffic is categorized into major classes (e.g., FTP and ferent types of applications utilizing network resources. P2P) and application identification in which end-user Consequently, accurate traffic classification has become applications (e.g., BitTorrent and Skype) identification one of the prerequisites for advanced network manage- is desired. Contrary to most of the current methods, ment tasks such as providing appropriate Quality-of- Deep Packet can identify encrypted traffic and also dis- Service (QoS), anomaly detection, pricing, etc. Traffic tinguishes between VPN and non-VPN network traffic. classification has attracted a lot of interests in both After an initial pre-processing phase on data, packets academia and industrial activities related to network are fed into Deep Packet framework that embeds stacked management (e.g., see [13], [15], [42] and the references autoencoder and convolution neural network in order therein). to classify network traffic. Deep packet with CNN as its As an example of the importance of network traffic classification model achieved recall of 0.98 in application classification, one can think of the asymmetric archi- identification task and 0.94 in traffic categorization task. tecture of today’s network access links, which has been To the best of our knowledge, Deep Packet outperforms arXiv:1709.02656v3 [cs.LG] 4 Jul 2018 designed based on the assumption that clients download all of the proposed classification methods on UNB ISCX more than what they upload. However, the pervasiveness VPN-nonVPN dataset. of symmetric-demand applications (such as peer-to-peer (P2P) applications, voice over IP (VoIP) and video call) Mohammad Lotfollahi has changed the clients’ demands to deviate from the Sharif University of Technology, Teharn, Iran assumption mentioned earlier. Thus, to provide a satis- Mahdi Jafari Siavoshani factory experience for the clients, other application-level Sharif University of Technology, Teharn, Iran knowledge is required to allocate adequate resources to E-mail: [email protected] such applications. Ramin Shirali Hossein Zade The emergence of new applications as well as in- Sharif University of Technology, Teharn, Iran teractions between various components on the Internet Mohammdsadegh Saberian has dramatically increased the complexity and diversity Sharif University of Technology, Teharn, Iran of this network which makes the traffic classification a 2 Mohammad Lotfollahi et al. difficult problem per se. In the following, we discuss in The rest of paper is organized as follows. In Section 2, details some of the most critical challenges of network we review some of the most important and recent stud- traffic classification. ies on traffic classification. In Section 3, we present the First, the increasing demand for user’s privacy and essential background on deep learning which is necessary data encryption has tremendously raised the amount to our work. Section 4 presents our proposed method, of encrypted traffic in today’s Internet [42]. Encryption i.e., Deep Packet. The results of the proposed scheme procedure turns the original data into a pseudo-random- on network application identification and traffic charac- like format to make it hard to decrypt. In return, it terization tasks are described in Section 5. In Section 6, causes the encrypted data scarcely contain any discrim- we provide further discussion on experimental results. inative patterns to identify network traffic. Therefore, Finally, we conclude the paper in Section 7. accurate classification of encrypted traffic has become a challenge in modern networks [13]. 2 Related Works It is also worth mentioning that many of the pro- posed network traffic classification approaches, such In this section, we provide an overview of the most as payload inspection as well as machine learning and important network traffic classification methods. In par- statistical methods, require patterns or features to be ticular, we can categorize these approaches into three extracted by experts. This process is prone to error, main categories as follows: (I) port-based, (II) payload time-consuming and costly. inspection, and (III) statistical and machine learning. Finally, many of the Internet service providers (ISPs) Here is a brief review of the most important and re- block P2P file sharing applications because of their cent studies regarding each of the approaches mentioned high bandwidth consumption and copyright issues [27]. above. These applications use protocol embedding techniques Port-based approach: Traffic classification via to bypass traffic control systems [3]. The identification port number is the oldest and the most well-known of this kind of applications is one of the most challenging method for this task [13]. Port-based classifiers use the task in network traffic classification. information in the TCP/UDP headers of the packets to There have been abundant studies on the network extract the port number which is assumed to be associ- traffic classification subject, e.g., [22, 32, 16, 30]. How- ated with a particular application. After the extraction ever, most of them have focused on classifying a proto- of the port number, it is compared with the assigned col family, also known as traffic characterization (e.g., IANA TCP/UDP port numbers for traffic classification. streaming, chat, P2P, etc.), instead of identifying a single The extraction is an easy procedure, and port numbers application, which is known as application identification will not be affected by encryption schemes. Because of (e.g., Spotify, Hangouts, BitTorrent, etc.) [20]. In con- the fast extraction process, this method is often used in trast, this work proposes a method, i.e., Deep Packet, firewall and access control list (ACL) [34]. Port-based based on the ideas recently developed in the machine classification is known to be among the simplest and learning community, namely, deep learning, [6, 23], to fastest method for network traffic identification. How- both characterize and identify the network traffic. The ever, the pervasiveness of port obfuscation, network benefits of our proposed method which make it superior address translation (NAT), port forwarding, protocol to other classification schemes are stated as follows: embedding and random ports assignments have signifi- cantly reduced the accuracy of this approach. According – In Deep Packet, there is no need for an expert to to [30, 28] only 30% to 70% of the current Internet traffic extract features related to network traffic. In the light can be classified using port-based classification methods. of this approach, the cumbersome step of finding and For these reasons, more complex traffic classification extracting distinguishing features has been omitted. methods are needed to classify modern network traffic. – Deep Packet can identify traffic at both granular Payload Inspection Techniques: These techniques levels (application identification and traffic charac- are based on the analysis of information available in terization) with state-of-the-art results compared to the application layer payload of packets [20]. Most of the other works conducted on similar dataset [16, 47]. the payload inspection methods, also known as deep – Deep Packet can accurately classify one of the hard- packet inspection (DPI), use predefined patterns like est class of applications, known to be P2P [20]. This regular expressions as signatures for each protocol (e.g., kind of applications routinely uses advanced port see [48, 36]). The derived patterns are then used to obfuscation techniques, embedding their information distinguish protocols form each other. The need for up- in well-known protocols’ packets and using random dating patterns whenever a new protocol is released, ports to circumvent ISPs’ controlling processes. and user privacy issues are among the most important Deep Packet: A Novel Approach For Encrypted Traffic Classification Using Deep Learning 3 drawbacks of this approach. Sherry et al. proposed a prone to human mistakes. Moreover, note that for the new DPI system that can inspect encrypted payload case of using k-NN classifiers, as suggested by [47], it without decryption, thus solved the user privacy issue, is known that, when used for prediction, the execution but it can only process HTTP Secure (HTTPS) traffic time of this algorithms is a major concern. [37]. To the best of our knowledge, prior to our work, Statistical and machine learning approach: only one study based on deep learning ideas has been Some of these methods, mainly known as statistical reported by Wangc [46]. They used stacked autoencoders methods, have a biased assumption that the underlying (SAE) to classify some network traffic for a large family traffic for each application has some statistical features of protocols like HTTP, SMTP, etc. However, in their which are almost unique to each application. Each sta- technical report, they did not mention the dataset they tistical method uses its functions and statistics. Crotti used. Moreover, the methodology of their scheme, the et al., [12] proposed protocol fingerprints based on the details of their implementation, and the proper report probability density function (PDF) of packets inter- of their result is missing. arrival time and normalized thresholds. They achieved up to 91% accuracy for a group of protocols such as 3 Background on Deep Neural Networks HTTP, Post Office Protocol 3 (POP3) and Simple Mail Transfer Protocol (SMTP). In a similar work, Wang et Neural networks (NNs) are computing systems made al., [45] have considered PDF of the packet size. Their up of some simple, highly interconnected processing el- scheme was able to identify a broader range of protocols ements, which process information by their dynamic including file transfer protocol (FTP), Internet Mes- state response to external inputs [8]. In practice, these sage Access Protocol (IMAP), SSH, and TELNET with networks are typically constructed from a vast number accuracy up to 87%. of building blocks called neuron where they are con- A vast number of machine learning approaches have nected via some links to each other. These links are been published to classify traffic. Auld et al. proposed called connections, and to each of them, a weight value a Bayesian neural network that was trained to clas- is associated. During the training procedure, the NN is sify most well-known P2P protocols including Kazaa, fed with a large number of data samples. The widely BitTorrent, GnuTella, and achieved 99% accuracy [4]. used learning algorithm to train such networks (called Moore et al. achieved 96% of accuracy on the same backpropagation) adjusts the weights to achieve the de- set of applications using a Naive Bayes classifier and a sired output from the NN. The deep learning framework kernel density estimator [31]. Artificial neural network can be considered as a particular kind of NNs with (ANN) approaches were proposed for traffic identifica- many (hidden) layers. Nowadays, with the rapid growth tion (e.g., see [40] and [41]). Moreover, it was shown of computational power and the availability of graphi- in [41] that the ANN approach can outperform Naive cal processing units (GPUs), training deep NNs have Bayes methods. Two of the most important papers that become more plausible. Therefore, the researchers from have been published on “ISCX VPN-nonVPN” traffic different scientific fields consider using deep learning dataset are based on machine learning methods. Gil framework in their respective area of research, e.g., see et al. used time-related features such as the duration [17, 38, 39]. In the following, we will briefly review two of the flow, flow bytes per second, forward and back- of the most important deep neural networks that have ward inter-arrival time, etc. to characterize the network been used in our proposed scheme for network traffic traffic using k-nearest neighbor (k-NN) and C4.5 deci- classification, namely, autoencoders and convolutional sion tree algorithms [16]. They achieved approximately neural networks. 92% recall, characterizing six major classes of traffic including Web browsing, email, chat, streaming, file transfer and VoIP using the C4.5 algorithm. They also 3.1 Autoencoder achieved approximately 88% recall using the C4.5 algo- rithm on the same dataset which is tunneled through An autoencoder NN is an unsupervised learning frame- VPN. Yamansavascilar et al. manually selected 111 flow work that uses backpropagation algorithm to reconstruct features described in [29] and achieved 94% of accuracy the input at the output while minimizing the reconstruc- for 14 class of applications using k-NN algorithm [47]. tion error (i.e., according to some criteria). Consider The main drawback of all these approaches is that the a training set {x1, x2, . . . , xn} where for each training feature extraction and feature selection phases are es- data we have xi ∈ Rn. The autoencoder objective is de- sentially done with the assistance of an expert. Hence, it fined to be yi = xi for i ∈ {1, 2, . . . , n}, i.e., the output makes these approaches time-consuming, expensive and of the network will be equal to its input. Considering 4 Mohammad Lotfollahi et al. this objective function, the autoencoder tries to learn a 1 1 3 21 1 1 3 compressed representation of the dataset, i.e., it approx- 3 imately learns the identity function FW ,b(x) ' x, where W and b are the whole network weights and biases 1 vectors. General form of an autoencoder’s loss function 1 1 is shown in (1), as follows
2 L(W , b) = kx − FW ,b(x)k . (1) 1 The autoencoder is mainly used as an unsupervised tech- 13 nique for automatic feature extraction. More precisely, the output of the encoder part is considered as a high- Fig. 2: Greedy layer-wise approach for training an level set of discriminative features for the classification stacked autoendcoder. task. Fig. 1 shows a typical autoencoder with n inputs and outputs. of convolutional operations. The construction of convo- lutional networks is inspired by the visual structure of
x1 x1 I1 O1 living organisms [18]. Basic building block underneath a
1 3 ! H1 H1 CNN is a convolutional layer described as follows. Con-
x2 2 x2 I2 H1 O2 sider a convolutional layer with N × N square neuron 1 3 ! layer as input and a filter ω of size m × m. The output H2 . H2 x3 x3 l I O of this layer z , is of size (N − m + 1) × (N − m + 1) 3 . . . 3 . . . . . ! and is computed as follows 2 . Hl . . . m−1 m−1 1 3 . Hm Hidden Hm . ` `−1 xn Layer 2 xn zij = f ωabz(i+a)(j+b) . (2) In Hidden Hidden On Layer 1 Layer 3 a=0 b=0 ! Input Ouput ! X X Layer Layer As it is demonstrated in (2), usually a non-linear func- tion f such as rectified linear unit (ReLU) is applied to Fig. 1: The general structure of an autoencoder. the convolution output to learn more complex features from the data. In some applications, a pooling layer In practice, to obtain a better performance, a more is also applied. The main motivation of employing a complex architecture and training procedure, called pooling layer is to aggregate multiple low-level features stacked autoencoder (SAE), is proposed [43]. This scheme in a neighborhood to obtain local invariance. Moreover, suggests to stack up several autoencoders in a manner it helps to reduce the computation cost of the network that output of each one is the input of the successive in train and test phase. layer which itself is an autoencoder. The training pro- CNNs have been successfully applied to different cedure of a stacked autoencoder is done in a greedy fields including natural language processing [35], com- layer-wise fashion [7]. First, this method trains each putational biology [2], and machine vision [38]. One of layer of the network while freezing the weights of other the most interesting applications of CNNs is in face layers. After training all the layers, to have more accu- recognition [24], where consecutive convolutional layers rate results, fine-tuning is applied to the whole NN. At are used to extract features from each image. It is ob- the fine-tuning phase, the backpropagation algorithm served that the extracted features in shallow layers are is used to adjust all layers’ weights. Moreover, for the simple concepts like edges and curves. On the contrary, classification task, an extra softmax layer can be applied features in deeper layers of networks are more abstract to the final layer. Fig. 2, depicts the training procedure than the ones in shallower layers [49]. However, it is of a stacked autoencoder. worth mentioning that visualizing the extracted features in the middle layers of a network does not always lead to meaningful concepts like what has been observed in the 3.2 Convolutional Neural Network face recognition task. For example in one-dimensional CNN (1D-CNN) which we use to classify network traffic, The convolutional neural networks (CNN) are another the feature vectors extracted in shallow layers are just types of deep learning models in which feature extrac- some real numbers which make no sense at all for a tion from the input data is done using layers comprised human observer. Deep Packet: A Novel Approach For Encrypted Traffic Classification Using Deep Learning 5
We believe 1D-CNNs are an ideal choice for the activity the application was engaged during the capture network traffic classification task. This is true since 1D- session (e.g., voice call, chat, file transfer, or video call). CNNs can capture spatial dependencies between adja- For more details on the captured traffic and the traffic cent bytes in network packets that leads to find discrimi- generation process, refer to [16]. native patterns for every class of protocols/applications, The dataset also contains packets captured over Vir- and consequently, an accurate classification of the traffic. tual Private Network (VPN) sessions. A VPN is a private Our classification results confirm this claim and prove overlay network among distributed sites which operates that CNNs performs very well in feature extraction of by tunneling traffic over public communication networks network traffic data. (e.g., the Internet). Tunneling IP packets, guarantee- ing secure remote access to servers and services, is the most prominent aspect of VPNs [10]. Similar to regular 4 Methodology (non-VPN) traffic, VPN traffic is captured for different applications, such as Skype, while performing different In this work, we develop a framework, called Deep activities, like voice call, video call, and chat. Packet, that comprises of two deep learning methods, Furthermore, this dataset contains captured traffic namely, convolutional NN and stacked autoencoder NN, of Tor software. This traffic is presumably generated for both “application identification” and “traffic char- while using Tor browser and it has labels such as Twit- acterization” tasks. Before training the NNs, we have ter, Google, Facebook, etc. Tor is a free, open source to prepare the network traffic data so that it can be software developed for anonymous communications. Tor fed into NNs properly. To this end, we perform a pre- forwards users traffic through its own free, worldwide, processing phase on the dataset. Fig. 3 demonstrates overlay network which consists of volunteer-operated the general structure of Deep Packet. At the test phase, servers. Tor was proposed to protect users against Inter- a pre-trained neural network corresponding to the type net surveillance known as “traffic analysis.” To create of classification, application identification or traffic char- a private network pathway, Tor builds a circuit of en- acterization, is used to predict the class of traffic the crypted connections through relays on the network in a packet belongs to. The dataset, implementation and way that no individual relay ever knows the complete design details of the pre-processing phase and the ar- path that a data packet has taken [14]. Finally, Tor uses chitecture of proposed NNs will be explained in the complex port obfuscation algorithm to improve privacy following. and anonymity.