Markov Chain Fingerprinting to Classify Encrypted Traffic
Maciej Korczynski´ ∗¶ and Andrzej Duda¶ ∗EENR & DIMACS, Rutgers University, New Jersey, USA ¶Grenoble Institute of Technology, CNRS Grenoble Informatics Laboratory UMR 5217, France Email: [email protected], [email protected]
Abstract—In this paper, we propose stochastic fingerprints fingerprints of sessions to classify application traffic. We call for application traffic flows conveyed in Secure Socket a fingerprint any distinctive feature allowing identification of a Layer/Transport Layer Security (SSL/TLS) sessions. The fin- given traffic class. In this work, a fingerprint corresponds to a gerprints are based on first-order homogeneous Markov chains for which we identify the parameters from observed training first-order homogeneous Markov chain reflecting the dynamics application traces. As the fingerprint parameters of chosen of an SSL/TLS session. The Markov chain states model a applications considerably differ, the method results in a very good sequence of SSL/TLS message types appearing in a single accuracy of application discrimination and provides a possibility direction flow of a given application from a server to a client. of detecting abnormal SSL/TLS sessions. Our analysis of the We have studied the Markov chain fingerprints for twelve results reveals that obtaining application discrimination mainly comes from incorrect implementation practice, the misuse of representative applications that make use of SSL/TLS: PayPal the SSL/TLS protocol, various server configurations, and the (an electronic service allowing online payments and money application nature. transfers), Twitter (an online social networking and micro- blogging service), Dropbox (a file hosting service), Gadu- I.INTRODUCTION Gadu (a popular Polish instant messenger), Mozilla (a part The importance of appropriate traffic classification methods of Mozilla add-ons service responsible for verification of the continues to grow. They are essential for effective network software version), MBank and PKO (two popular European planning, policy-based traffic management, application priori- online banking services), Dziekanat (student online service), tization, and security control. However, traditional port-based Poczta (student online mail service), Amazon S3 (a Simple [1] and payload-based [2], [3] classification methods become Storage Service) and EC2 (an Elastic Compute Cloud), and less effective, because new applications can hide their nature Skype (a VoIP service). The resulting models exhibit a specific by dynamically assigning ports, by using tunneling, or by ap- structure allowing to classify encrypted application flows by plying proprietary payload encryption methods. This situation comparing its message sequences with fingerprints. They can has led to the development of new identification methods based also serve to reveal intrusions trying to exploit the SSL/TLS on flow features [4], [5] and host behavior [6], [7]. They are protocol by establishing abnormal communications with a useful for classification of application layer protocols (e.g. server. HTTP, DNS, BitTorrent, etc.) or traffic categories (e.g. P2P content sharing, games, multimedia, WWW, etc.). However, II.SSL/TLSOVERVIEW they are less effective for reliable identification of application Secure Sockets Layer (SSL) [11] and its successor Transport flows on top of a given protocol. Moreover, apart from some Layer Security (TLS) [10] are cryptographic protocols that notable exceptions, these methods are not appropriate for provide secure communication between two parties over the classifying traffic where only one direction is observed due Internet by encapsulating and encrypting application layer to routing asymmetries [8]. data. Many WWW portals and servers, especially those pro- The past research on traffic analysis and classification viding commercial services, use SSL/TLS for guaranteeing showed that once we are able to generate a unique signature security of all operations. based on the packet or message payload (e.g. HTTP request Figure 1 illustrates the structure of SSL/TLS and its com- headers), we can classify applications with high accuracy [3], ponents: [8]. Unfortunately, such approaches fail in case of encrypted • Record Protocol: compresses and encrypts upper-layer traffic [9]. In this work, we propose a payload-based method data using the security parameters configured by the to identify application flows encrypted with the Secure Socket Handshake Protocol. Layer/Transport Layer Security (SSL/TLS) protocol, which is • Application Data Protocol: provides application layer a fundamental cryptographic protocol suite supporting secure data to the Record Protocol. communication over the Internet [10]. • Handshake Protocol: negotiates parameters of an Our approach consists of taking advantage of the infor- SSL/TLS session. Two communicating parties agree on mation embedded in the SSL/TLS header to create statistical the protocol version to use, they optionally authenticate 978-1-4799-3360-0/14/$31.00 c 2014 IEEE each other, exchange information on the session ID, select Change Cipher Spec k server sends its own message and the Finished message under the new cipher specification, k which completes the SSL/TLS handshake and the two parties can exchange application layer data. The server terminates k k the session with an Alert message. The exchange is an example—a session can be shortened k k by resuming previous sessions using the Session ID or it can be significantly modified depending on server configuration