Revealing Skype Traffic

Revealing Skype Traffic: When Randomness Plays with You Dario Bonfiglio Marco Mellia Michela Meo Politecnico di Torino Politecnico di Torino Politecnico di Torino Dipartimento di Elettronica Dipartimento di Elettronica Dipartimento di Elettronica dario.bonfi[email protected] [email protected] [email protected] Dario Rossi Paolo Tofanelli ENST Télécom Paris Motorola Inc. Informatique et Réseaux Torino - Italy [email protected] [email protected] ABSTRACT General Terms Skype is a very popular VoIP software which has recently attracted Experimentation, Measurement the attention of the research community and network operators. Following a closed source and proprietary design, Skype proto- cols and algorithms are unknown. Moreover, strong encryption Keywords mechanisms are adopted by Skype, making it very difficult to even Traffic Identification, Passive Measurement, Na¨ıve Bayesian Clas- glimpse its presence from a traffic aggregate. In this paper, we sification, Pearson Chi-Square Test, Deep Packet Inspection propose a framework based on two complementary techniques to reveal Skype traffic in real time. The first approach, based on Pear- son’s Chi-Square test and agnostic to VoIP-related traffic character- 1. INTRODUCTION istics, is used to detect Skype’s fingerprint from the packet framing The last few years witnessed VoIP telephony gaining a tremen- structure, exploiting the randomness introduced at the bit level by dous popularity, as also testified by the increasing number of op- the encryption process. Conversely, the second approach is based erators that are offering VoIP-based phone services to residential on a stochastic characterization of Skype traffic in terms of packet users. Skype [1] is beyond doubt the most amazing example of this arrival rate and packet length, which are used as features of a deci- new phenomenon: developed in 2003 by the creators of KaZaa, it sion process based on Naive Bayesian Classifiers. recently reached over 100 millions of users, becoming so popular In order to assess the effectiveness of the above techniques, we that people indicate Skype IDs in their visiting cards. develop an off-line cross-checking heuristic based on deep-packet A number of reasons for such a success can be acknowledged. inspection and flow correlation, which is interesting per se. This First, today Internet (in terms of capacity, responsiveness, robust- heuristic allows us to quantify the amount of false negatives and ness) makes it possible to provide new and demanding services, false positives gathered by means of the two proposed approaches: including real-time interactive applications such as telephony. Sec- results obtained from measurements in different networks show ond, the users attitude toward technology has deeply changed in that the technique is very effective in identifying Skype traffic. the last few years: users are willing to accept a good service for While both Bayesian classifier and packet inspection techniques free, even though service continuity and quality is not guaranteed; are commonly used, the idea of leveraging on randomness to reveal they (we?) like to have access from the same terminal and even the traffic is novel. We adopt this to identify Skype traffic, but the same application environment to a number of different communi- same methodology can be applied to other classification problems cation facilities; new ways and tools to be connected to each other as well. are easily accepted and experienced by people. Last but not least, Skype is an extremely good piece of software, carefully engineered, user friendly and efficient at the same time. The importance of Skype traffic identification –besides being in- Categories and Subject Descriptors strumental to traffic analysis and characterization for network design and provisioning– is clear when considering the interest of C.4 [Computer Communication]: Measurement Techniques; C.2.5 network operators, ranging from traffic and performance monitor- [Computer Communication Network]: Internet ing, to the design of tariff policies and traffic differentiation strate- gies. To date however, despite the interest recently exhibited by the research community, reliable identification of Skype traffic re- mains a challenging task, given that the software is proprietary and Permission to make digital or hard copies of all or part of this work for the traffic is obfuscated. The objective of this paper is to define a personal or classroom use is granted without fee provided that copies are framework, based on two different and complementary techniques, not made or distributed for profit or commercial advantage and that copies for revealing and classifying Skype traffic from a traffic aggregate, bear this notice and the full citation on the first page. To copy otherwise, to irrespectively of the transport layer protocol that is being used (i.e., republish, to post on servers or to redistribute to lists, requires prior specific TCP or UDP). Both techniques are scalable, can be performed on- permission and/or a fee. SIGCOMM’07, August 27–31, 2007, Kyoto, Japan. line, and are applicable to a more general extent than the context Copyright 2007 ACM 978-1-59593-713-1/07/0008 ...$5.00. of Skype traffic identification. The first approach, based on Pear- son’s Chi Square test, is used to detect Skype’s fingerprint from the that almost everything is cyphered, that data can be fragmented, packet framing structure but is agnostic to VoIP-related traffic char- and that an extensive use of data compression (based on arithmetic acteristics. To the best of our knowledge, this work is the first to in- compression) is made as well. The work in [5] presents an ex- troduce this methodology for the purpose of traffic identification: in perimental study of Skype, where results are collected by means the following, we refer to this novel identification approach as Chi- of measurements over a five months period. Authors analyze user Square Classifier (CSC). Conversely, the second approach is based behavior only for relayed 1, rather than direct, sessions. Results on a stochastic characterization of Skype traffic in terms of packet pertain the population of on-line clients and their usage pattern, the arrival rate and packet length, which are employed as features of a number of super-nodes and bandwidth usage: thus, the identifica- decision process based on Naive Bayesian Classifiers (NBC): how- tion problem is no longer related to Skype traffic but, rather, to ever, while the above features successfully allow to identify VoIP Skype users. traffic, they are not representative of the application that generated The works closest to ours are [6, 7]. [6] deals with the evalua- it. tion of the QoS level provided by Skype calls. However, the paper To proof-check the correctness of the statistical techniques, we focus is more on the QoS evaluation rather than on the identifica- develop a Payload Based Classifier (PBC), that relies on tradi- tion of the flows. Specifically, authors consider a valid VoIP ses- tional technique of deep-packet inspection, combined with a per- sion whenever i) flow duration is longer than a threshold (namely, host analysis that allows us to identify Skype clients and their gen- 10 seconds), ii) average packet rate is within a reasonable range erated traffic. We use the PBC to cross-check the results obtained (between 10 and 100 pkt/sec), iii) average packet size is small (be- from the statistical approaches. In particular, from the controlled tween 30 and 300 bytes), and iv) Exponentially Weighted Moving- testbed experiments and from real traffic traces as well, the PBC Average of the packet size falls in a given range (between 35 and is used to create a benchmark dataset in which we classify Skype 500 bytes) for the whole flow duration. However, these charac- flows with a very high confidence level. Moreover, the benchmark teristics are typical of all VoIP traffic, and not only of Skype traf- dataset is used to tune parameters of the above classifiers, as well fic. Therefore, authors propose a complex algorithm to identify the as to quantify the number of NBC/CSC false-positives and false- UDP port used by Skype. All traffic originated from, sinked by that negatives. Indeed, by running the NBC and CSC onto the bench- (IP address - UDP port) will be labeled as Skype traffic. Besides mark dataset, we can assess the effectiveness of the two classifiers, being very complex, this approach applies only when Skype uses when they are either separately or jointly used. We anticipate that UDP at the transport layer and it needs the Skype login phase to be the combination of NBC and CSC yields to astonishingly good re- monitored: it is, thus, likely to fail on backbone links. Moreover, sults. The joint NBC+CSC classification method effectively limits any modification during the login phase in future Skype releases the number of false positives, yielding to conservative results, in the will make the algorithm useless. We, on the contrary, would like sense that the number of non-Skype flows erroneously classified as to pin out all VoIP traffic generated by Skype only, possibly with such is negligible. simple algorithms and in any scenario. Finally, in [7] authors focus A mythological analogy fits to the above framework if we per- on the identification of relayed traffic, and present an application sonify the problem of Skype traffic identification with Khaos,which to Skype. The adopted approach is to correlate, at the relay node, typically refers to unpredictability and in Greek mythology [2] was the incoming and outgoing packet time series and bandwidth usage. referred to as the primeval state of existence, from which the pro- More on details, authors focus on what they call “burst” of traffic. togenoi (i.e., the first gods) appeared. The payload based heuristic Then, for every pair of packet bursts, an analysis of the correlation allows us to bring light to the problem solution, and it can thus of the packet arrival series within the burst is performed.

Load more