<<

DIFFERENTIATING INTERNET APPLICATIONS USING PRINCIPAL COMPONENT ANALYSIS

Roberto Nogueira*, Ant´onioNogueira**, Paulo Salvador**, Rui Valadas** *Portugal Telecom Inova¸c˜ao,Aveiro, Portugal e-mail: [email protected] **University of Aveiro/Instituto de Telecomunica¸c˜oes,Aveiro, Portugal e-mail: {nogueira, salvador, rv}@ua.pt

ABSTRACT

The number and variety of IP applications had a tremendous increase in the last few years. Besides, Internet applications of end users are changing with the wide spread of high performance PCs connected through broadband links. An accurate mapping of traffic to applications is important for a wide range of network management tasks, like traffic engineering, service differentiation, per- formance/failure monitoring and security. Since traditional mapping approaches have become increasingly inaccurate, this paper presents a new approach, based on Principal Component Analysis, that is able to identify differentiating char- acteristics of different Internet applications, including several P2P file protocols. The accuracy of the proposed approach was evaluated by performing a set of intensive tests and the results obtained show that it constitutes a valu- able tool to identify peculiar characteristics of Internet applications while being, at the same time, immune to the most important disadvantages presented by other identification methods. We believe this methodology can form the basis for the development of an efficient application identification tool.

Keywords: flow identification, peer-to-peer, principal component analysis.

1 INTRODUCTION security, and legal troubles for network adminis- trators on high-speed networks.

The introduction of Peer-to-Peer (P2P) file Having the ability to accurately identify Inter- sharing applications triggered a paradigm shift net applications, and particularly P2P ones, can in Internet data exchange. Since the emergence be crucial for several network management and of , the first popular P2P application, a measurement tasks, including traffic engineering, number of new P2P based multimedia file sharing service differentiation, performance/failure moni- systems have been developed (FastTrack, eDon- toring and security. Once correctly identified, the key, , Direct Connect, etc). The traf- network manager can take the most appropriate fic generated by these applications consumes the action regarding each application that is running major portion of the bandwidth in campus net- on a particular scenario. works, largely overtaking the traffic of the The identification of IP applications has been WWW [1, 2]. However, P2P applications can traditionally based on different techniques, each harm network traffic generated by businesses, gov- one having its own advantages but also important ernments, education and the Internet infrastruc- drawbacks that limit or dissuade their application ture itself, preventing mission critical applications on certain identification scenarios: (i) port based from accessing the network. These applications analysis presents some obvious limitations since can also represent serious security vulnerabilities most applications allow users to change the de- to systems and networks since hackers can exploit fault port numbers by manually selecting what- them to access and attack campus networks. Fi- ever port(s) they like; many newer applications nally, P2P applications also pose serious legal is- are more inclined to use random ports, thus mak- sues as users can download copyrighted material, ing ports unpredictable, and there is also a trend thus placing access providers in a difficult legal for applications to begin masquerade their func- situation. So, these applications create logistic, tion ports within well-known application ports;

Ubiquitous Computing and Communication Journal 1 () protocol analysis is ineffective since IP ap- were made; section 5 gives a brief review of the plications are continuously evolving and therefore most important topics on PCA; section 6 de- their signatures can change; application develop- scribes the methodology that is proposed to iden- ers can encrypt traffic making protocol analysis tify the differentiating characteristics of Internet more difficult; signature-based identification can applications and discusses the main results ob- affect network stability because it has to read and tained and, finally, section 7 presents the main process all network traffic and, finally, protocol conclusions and some topics for future research. analysis is not able to deal with confidentiality re- quirements; (iii) syntactic and semantic analysis 2 RELATED WORK of the data flows can be a burden to network sta- bility due to its high processing requirements and Traditionally, the identification of IP appli- is not appropriate when dealing with confidential- cations has been based on different techniques. ity requirements because, in these situations, it Port-based identification was first suggested by is not possible to have access to the packet con- [4, 5] and is the most basic and straightforward tents. Section 2 will briefly describe the most im- method to detect applications and users based on portant related work on this subject, pointing out network traffic. It is based on the simple con- the main advantages and disadvantages of the dif- cept that many applications have default ports on ferent proposed methodologies. which they function. When these applications are This paper proposes an approach, based on run, they use these ports to communicate with the Principal Component Analysis (PCA), to identify outside. To perform port based analysis, admin- the most differentiating characteristics of Inter- istrators just need to observe network traffic and net applications. PCA involves a mathematical check whether there are connection records using procedure that transforms a number of (possibly) these ports. If a match is found, it may indicate a correlated variables into a (smaller) number of un- particular application activity. Port matching is correlated variables called principal components, very simple in practice, but its limitations are ob- thus reducing the dimensionality of the data set vious. Most applications allow users to change the by ignoring the dimensions that contribute less to default port numbers by manually selecting what- the data variability. Several applications will be ever port(s) they like. Additionally, many newer used to test the proposed methodology: some of applications, like WinMX [6] and [7], are the most popular P2P applications (Gnutella, Bit- more inclined to use random ports, thus making Torrent and eMule), Skype, YouTube, web-based ports unpredictable. Besides, since the closure of file sharing and Internet browsing. The results ob- Napster more and more P2P applications begin tained show that the proposed methodology can to masquerade their function ports within well- achieve very good results, has low computational known application ports [2, 8, 4, 9, 10]. In [11], requirements and, when used in an application authors have shown that this technique achieves identification framework, allows to avoid some of an accuracy no better than 50 to 70% using the the most important drawbacks of existing identi- official IANA1 list. fication approaches. Another identification approach is payload or This paper is an extended version of the work protocol analysis: in this case, traffic is monitored published in [3]: now, we include an extended and the data payload of the packets is inspected related work section, a description of the differ- according to some previously defined application ent modules of an integrated identification frame- signatures [12, 13, 14, 8, 11, 15, 9]. This traffic work that will be based on the PCA identification identification method is widely applied on Intru- methodology and a more detailed explanation of sion Detection Systems (IDS) to manage traffic the different steps of the proposed PCA identifica- [16, 17]. Application-layer analysis of packet con- tion methodology, including more graphical infor- tents is also employed by some commercial band- mation to help on the explanation of the proposed width management tools [18, 19]. This approach approach. has been shown to work very well for Internet The rest of the paper is organized as follows: traffic including P2P applications. However, this section 2 describes some related work on method- technique also has some drawbacks: first, pay- ologies for the identification of IP applications; load analysis poses privacy and security concerns; section 3 presents an overview of the application second, the technique typically requires increased identification framework we are planning to pro- processing and storage capacity [20, 5, 21, 22]; pose, where the PCA-based identification mod- third, it is unable to cope with encrypted trans- ule is one of the main building blocks; section 4 missions and, finally, this approach only identifies gives an overview of the applications that were se- traffic for which signatures are available and is lected for this study and the measurements that unable to classify previously unknown traffic.

Ubiquitous Computing and Communication Journal 2 Syntactic and semantic analysis of the data and the presence of many hosts acting as both flow avoids some of the disadvantages of port- servers and clients. It utilizes only the trans- based analysis and protocol analysis. This ap- port layer header of every packet, and can iden- proach can perform protocol recognition regard- tify unknown P2P protocols. However, it is also less of any encapsulation and is able to extract time-consuming. In [33, 34] authors proposed a data specific to each protocol, involving stateful technique that relies on the observation of the reconstruction of session and application informa- first five packets of a TCP connection to identify tion from the packet content [23]. This technique the application. In [35] an algorithm is proposed provides very accurate and reliable application to identify P2P traffic based on machine learn- identification, but imposes significant complexity ing techniques: by investigating the ratio between and processing load on the traffic identification the upload and download traffic volume of several device. It must be kept up-to-date with extensive P2P applications, a characteristic library is con- knowledge of application semantics and network- structed; then, the unknown network traffic can level syntax, and must be powerful enough to be recognized online using this library. In refer- perform concurrent analysis of a potentially large ences [36, 37] authors also use machine learning number of flows. techniques for traffic classification and [38] pro- Karagiannis et all. [8] proposed a new algo- poses a traffic classifier using supervised machine rithm which is based on the behavior character- learning based on a Bayesian trained neural net- istic of the transport layer: by using little infor- work. On reference [39] a back propagation neural mation of transport layer packets, this method network model is used to distinguish between P2P can accurately identify 99% of the P2P traffic, and non-P2P applications; finally, in [40] neural but the algorithm can only be used offline. In networks were successfully used to identify several [24] a new identification method is proposed, re- Internet applications, although none of them was lying on patterns of host behavior at the trans- of the P2P type. port layer. Instead of studying TCP (or UDP) flows individually, this scheme pays attention to 3 IDENTIFICATION FRAMEWORK all flows generated by specific hosts and can ac- BASED ON PCA curately associate each host with the services it provides or uses (application server, web , The proposed identification methodology will etc). However, this method has to gather infor- constitute the basis for an integrated identifica- mation from several flows of each host before it tion framework, whose functioning principles are can decide on the host role, which makes it very depicted in Figure 1. The central element of time-consuming. the identification tool is the PCA-based identi- The diminished effectiveness of the aforemen- fication methodology: first, the predominance ar- tioned techniques motivated the use of flow statis- eas corresponding to each IP application are cal- tics for classifying network traffic. There are at culated using a set of known traffic values as- least three reasons why this approach is recom- sociated with each application; after this pre- mended: first, different applications manifest dis- identification phase, the PCA methodology can similar behaviors and thus exhibit different flow be used to identify IP applications based on new statistics (for instance, a large file transfer using traffic values that are presented as inputs. Obvi- FTP would have higher average packet size and ously, the pre-calculation of the predominance ar- smaller mean packet interarrival time than an in- eas relies on a pre-classification of the various IP stant messaging client sending short, occasional, applications that is based on offline measurements messages to other clients); second, although ob- that were previously made and stored. The pre- fuscation of flow statistics is also possible, it is classification can rely on conventional application generally much harder to implement; third, clas- mapping approaches or can derive from known sification based on flow statistics can benefit from traffic generated and measured in a controlled en- the large body of work on scalable flow sam- vironment. So, this kind of training phase, al- pling/estimation techniques [25, 26, 27, 28, 29]. though it can be computationally demanding, is Several methods have been proposed to clas- an offline phase. sify traffic based on summarized flow information The online classification phase relies on on- such as duration, number of packets and mean line measurements that are continuously made on inter-arrival time [30, 31], but they are all off- the network infrastructure and are also stored to line algorithms and are not sufficiently mature. become essential historic data for further refine- In [32], the authors use some fundamental char- ments on the predominance areas pre-calculation acteristics of P2P protocols to identify P2P ap- phase (training phase). The result of the online plications, such as the huge network diameter classification phase, that is, the correct identifi-

Ubiquitous Computing and Communication Journal 3 Offline Online measurements measurements

PCA-based pre-calculation Pre- PCA-based of the predominace areas identification for each IP application identification

Application Validation Identification

Figure 1: PCA-based integrated framework for identifying IP applications. cation of the different applications, must finally Web-based file sharing and web browsing are well be validated. This process involves human inter- known and very common Internet applications. vention and must also take into account the pre- Our study resorts to data traces that were classification of the different applications. The measured from March to July, 2007, on a 4Mbps cycle that constitutes this framework is a contin- ADSL access link. In order to keep the same hard- uously evolving structure, so the different blocks ware and software configurations and the same of the flowchart are continuously updated in such application settings, the following parameteriza- a way that reflects the best possible classification tions were adopted: (i) maximum download rate of the different Internet applications. of 96 Kbps; (ii) maximum upload rate of 10 Kbps and (iii) maximum number of connections equal 4 SELECTED APPLICATIONS AND to 100. In order to compare the behavior of sim- MEASURED TRAFFIC TRACES ilar applications, the same configurations, cap- ture durations, querys and transfered files were In order to cover the most popular P2P pro- considered for all measurement sessions. Each tocols, the following applications were selected measurement session registered the packet head- for this study: 2.2.5, that was used to ers of all packets flowing in both directions (up- connect to the Gnutella (version 1) network (al- load/download); no packet drops were reported. though it allows connection to several P2P net- The traffic analyzer was a 800 MHz Intel Celeron works), eMule 0.48A to connect to the eMule net- Laptop having 256 Mbytes of SDRAM and run- work and BitTorrent 5.0.5 to connect to the Bit- ning WinDump. Torrent network. Besides these P2P applications, As an illustrative example, the bitTorrent ses- we have also included other important applica- sion that was planned in order to capture this tions in terms of their contribution to the cur- type of traffic consisted on the following sequence rent Internet usage and exchanged traffic amount: of actions: Skype 3.5.0, that enables connection to the Skype 1. start of the traffic capture (windump -i 2 -C 700 network, YouTube as an example of a centralized -w bittorrentCapture.dmp); file sharing application, web-based file sharing and 2. connection to the .org website: web browsing. (a) Search for Vista Transformation Pack; Skype is the most popular Voice over IP (b) Order the transfer of the Vista Trans- (VoIP) and instant messaging application. The formation Pack 6.0.exe.torrent (9.8KB) Skype protocol defines the direct exchange of torrent; packets between peers: whenever direct exchange is not possible, Skype relies on routing mech- 3. start the bitTorrent application; anisms that use other peers of the Skype net- 4. wait for approximately 2 minutes such that the work. The decentralized Skype infrastructure host connected to the bitTorrent network can makes it scalable without implying additional stabilize (flow of the application initialization costs. YouTube is a centralized file sharing ser- messages); vice that enables the upload and download of 5. start the transfer of the Vista Transformation video files to/from network servers. Although Pack 6.0.exe (30.27MB) file, using the torrent; this application is not P2P, it is included in this 6. two minutes later, restore the connection to the study due to its current popularity/importance. mininova.org website:

Ubiquitous Computing and Communication Journal 4 (a) Search for Skype; norm, maximum variance and PC k, k ≥ 2, is (b) Order the transfer of the Skype V. uncorrelated with the previous PCs, which in t 3.1.0.152 Final.exe.torrent (8.4KB) - fact means that αkαj = 0, j = 1, . . . , k − 1 and t rent; αkαk = 1. Thus, the first principal component is the linear combination of the observed variables 7. start the transfer of the Skype V. 3.1.0.152 Fi- nal.exe (19.97MB) file, using the torrent; with maximum variance. The second principal component verifies a similar optimal criteria and 8. close the application, after 10 minutes of inac- is uncorrelated with PC 1, and so on. As a re- tivity; sult, the principal components are indexed by 9. open the application; wait for approximately 2 decreasing variance, i.e., λ1 ≥ λ2 ≥ ... ≥ λp, minutes such that the host connected to the bit- where λr denotes the variance of PC r and p is Torrent network can stabilize; the maximum number of PCs (n > p). 10. restore the connection to the mininova.org web- It can be proved [41] that the vector of load- site: ings of the k-th principal component, αk, is the (a) Search for AVG; eigenvector associated with the k-th highest eigen- (b) Order the transfer of the AVG Profes- value, λk, of the covariance matrix of the observed sional Internet Security Suite.rar.torrent variables. Therefore, the k-th highest eigenvalue (10.6KB) torrent; of the covariance matrix is the variance of PC k, i.e. λk = Var(Zk). 11. start the transfer of the AVG Professional In- The proportion of the total variance explained ternet Security Suite.rar (39.23MB) file, using by the first r principal components is the torrent; λ + ... + λ 12. after a five minutes period, restore the connec- 1 r . (2) tion to the mininova.org website: λ1 + ... + λp (a) Search for Ubuntu; If this proportion is close to one, than there is al- (b) Order the transfer of the Ubuntu-6.10- most as much information in the first r principal Desktop-i386.iso.torrent (27.5KB) tor- components as in the original p variables. In prac- rent; tice, the number r of considered principal compo- nents should be chosen as small as possible, taking 13. start the transfer of the Ubuntu-6.10-Desktop- into account that the proportion of the explained i386.iso (698.36MB) file, using the torrent; variance, equation (2), should be large enough. 14. close the application, after one hour of activity; Once the loadings of the principal components 15. end of the capture. are obtained, the score of object i on PC j is given by Other utilization sessions were also defined for the measurements scenarios corresponding to the re- zij = αj1xi1 + αj2xi2 + ... + αjpxip (3) maining selected applications. t where xi = (xi1, . . . , xip) is the data correspond- 5 PRINCIPAL COMPONENT ANALY- ing to object i. SIS 6 IDENTIFICATION METHODOLOGY Principal component analysis involves a math- AND RESULTS ematical procedure that transforms a number of This section will describe in detail the pro- (possibly) correlated variables into a (smaller) posed identification methodology and the main number of uncorrelated variables called principal results obtained from its application on the mea- components. The first principal component ac- surement traces presented in section 4. counts for as much of the variability in the data Each capture file was processed using the as possible, and each succeeding component ac- TSTAT application [42], that correlates forward counts for as much of the remaining variability as and backward packet streams in order to obtain possible. detailed statistical information about each packet Given the random variables X ,X ,...,X , 1 2 p flow. This information will be used by the pro- the k-th principal component (PC k) is defined posed identification methodology to identify the as the linear combination, traffic characteristic patterns associated to each IP application. Table 1 presents the most rel-

Zk = αk1X1 + αk2X2 + ... + αkpXp (1) evant upload and download statistics that were outputted by TSTAT for each one of the mea- such that the loadings of Zk, αk = sured traces: in fact, the table only presents a t (αk1, αk2, . . . , αkp) , have unitary Euclidean small set of all the parameters that are outputted

Ubiquitous Computing and Communication Journal 5 Table 1: Some of the upload and download statistics outputted by TSTAT. CLIENT #Flows Packets RST ACK Pure ACK KBytes Data packets Kbytes (w/ retrans) rexmit packets Shareaza 1075 58893 65 56077 53225 145,072 2761 157,292 1847 eMule 998 116631 16 115376 21930 107780,784 92451 108433,256 1394 bitTorrent 1166 205082 174 203373 85392 124903,684 117242 126209,951 2675 HTTP 697 79923 281 79192 76818 1120,075 1723 1126,331 42 Browsing 824 11358 317 10450 7364CLIENT 1225,208 2351 1296,95 143 #Flows Packets RST ACK Pure ACK KBytes Data packets Kbytes (w/ retrans) rexmit packets Skype 51 318 3 226 59 4,491 150 4,499 46 Shareaza 1075 58893 65 56077 53225 145,072 2761 157,292 1847 youTube eMule278 40670998 116631 170 16 40383 115376 38206 21930 107780,784 1369,82 92451 1912 108433,256 1401,508 1394 61 rexmitbitTorrent KBytes SYN count1166 FIN 205082 count SACK 174 sent 203373 rtx RTO 85392 124903,684 rtx FR reordering 117242 126209,951 unknown 2675 unnece rtx RTO HTTP 697 79923 281 79192 76818 1120,075 1723 1126,331 42 Shareaza 13,965 2815 27 5463 1824 14 0 21 0 Browsing 824 11358 317 10450 7364 1225,208 2351 1296,95 143 eMule Skype652,833 125151 318 1002 3 1918 226 1146 59 4,491 21 150 1694 4,499 1563 46 128 bitTorrent 1306,845youTube 278 1679 40670 646 170 4925 40383 38206 1503 1369,82 117 1912 3735 1401,508 2159 61 861 HTTP 6,29rexmit KByte 730s SYN count 371 FIN count SACK 383 sent rtx RTO 42 rtx FR 0 reordering unknown 0 unnece rtx RTO 0 0 Shareaza 13,965 2815 27 5463 1824 14 0 21 0 Browsing 71,826 908 418 249 133 1 0 9 0 eMule 652,833 1251 1002 1918 1146 21 1694 1563 128 Skype bitTorrent0,0521306,845 92 1679 14 6467 4925 1503 43 117 0 3735 0 2159 8613 0 youTube HTTP31,6986,29 287 730 95 371 1153 383 4233 0 0 0 0 0 28 0 0 Browsing 71,826 908 418 249 133 1 0 9 0 Skype 0,052 92 14 7 43 0 0 3 0 youTube 31,698 287 95 1153 33SERVER 0 0 28 0 #Flows Packets RST ACK Pure ACK KBytes Data packets Kbytes (w/ retrans) rexmit packets Shareaza 1075 93273 17 93269 2068SERVER 115106,694 91029 115244,894 257 #Flows Packets RST ACK Pure ACK KBytes Data packets Kbytes (w/ retrans) rexmit packets eMule 998 91451 21 91446 55717 34366,424 33826 35017,843 954 Shareaza 1075 93273 17 93269 2068 115106,694 91029 115244,894 257 bitTorrent eMule1166 216601998 91451 270 21 216574 91446 80998 55717 34366,424 129338,573 33826 133337 35017,843 131278,927 954 3164 HTTP bitTorrent697 1514351166 216601 8 270 151430 216574 80998 1315 129338,573 205121,848 133337 149113 131278,927 205184,553 3164 65 HTTP 697 151435 8 151430 1315 205121,848 149113 205184,553 65 Browsing 824 14133 22 14126 1443 11368,405 11538 11377,214 39 Browsing 824 14133 22 14126 1443 11368,405 11538 11377,214 39 Skype Skype 5151 343 343 0 0 343 343 138 138 37,696 37,696 184 184 37,696 37,696 1 1 youTube youTube278 72746278 72746 7 7 72740 72740 445 445 96745,717 96745,717 71934 71934 97004,513 97004,513 194 194 rexmit KBytes rexmitSYN KBytecounts SYN FIN count count FIN count SACK SACK sent sent rtx rtx RTO RTO rtx rtx FR FR reordering reordering unknown unknown unnece rtx RTO unnece rtx RTO Shareaza 138,212 115 54 4 86 4 858 440 99 Shareaza 138,212 115 54 4 86 4 858 440 99 eMule 651,495 959 935 4087 660 15 187 978 224 eMule bitTorrent651,4951940,692 959 1138 935 931 4087 6717 2416 660 213 15 1478 187 2616 978 514 224 bitTorrent 1940,692HTTP 62,717 1138 674 931 573 6717 0 2416 55 32 213 176 1478 34 2616 6 514 Browsing 8,838 787 577 1 68 32 136 44 1 HTTP 62,717 674 573 0 55 32 176 34 6 Skype 0,001 11 11 0 1 0 9 1 0 Browsing youTube8,838258,8 787 272 577 163 1 0 3068 30 32 776 136 123 44 4 1 Skype 0,001 11 11 0 1 0 9 1 0 youTube 258,8 272 163 0 30 30 776 123 4 TABLE I UPLOAD AND DOWNLOAD STATISTICS OUTPUTTED BY TSTAT. TABLE I Table 2: Average session durations per application. UPLOAD AND DOWNLOAD STATISTICS OUTPUTTED BY TSTAT. Shareaza eMule bitTorrent HTTP Browsing Skype youTube Mean Duration (sec) 31.83 79.25 108.84 24.65 28.25 70.89 67.11

TABLE II Shareaza eMule bitTorrent HTTP Browsing Skype youTube Mean Duration (sec)AVERAGE SESSION DURATIONS PER APPLICATION. by TSTAT, the one that was chosen based31.83 on their 79.25 108.84a qualitative 24.65 28.25 overview 70.89 onthe 67.11 relative importance relevance for the purposes of the current study; of each parameter to the traffic identification ob-

4 note thatx 10 the meaningbitTorrent of each column can beTABLEjective. II Looking at Figure 2, we can see that for 2 Browsing eMule 2000 found in the Appendix. TableA 2VERAGE presents SESSION the av- DURATIONSall PER applications APPLICATION10000 there. are basically two kinds of erage session durations per application. As can flows: flows related to TCP connection establish- be seen,1 file sharing applications are predominant1000 ment and termination5000 phases and flows related to

4 in terms of generated traffic. The upload traffic data transfer. For each one of these types there is 0 bitTorrent 0 0 download packets [Number]

x 10 download packets [Number] 0 500 1000 1500 2000 2500 0 50 100Browsing150 200 250 download packets [Number] 0 1000 2000 3000 eMule4000 2 corresponding toCompletion Shareaza time [sec] is slightly lower than a roughly linear growth on the number of packets Completion time [sec] 2000 CompletionCompletion time time [sec] [sec] Completion time [sec] 4 4 100004 4 Http 4 the uploadx 10 traffic of other P2P applicationsx 10 (and Shareazaas a function of the flow duration.Skype A quite clear 3 6 300 HTTP file sharing), since Gnutella only shares linear growth on the number of uploaded as 1 complete2 files. Regarding download,1000 4 eMule is a function of the200 number5000 of valid round trip time the file1 sharing application with the worst2 perfor- values can also100 be observed from Figure 3.

0 download packets [Number] 0 0 0 download packets [Number] 0 0 download packets [Number] mance, while Shareaza and web-based file sharing download packets [Number] download packets [Number] download packets [Number] download packets [Number] download packets [Number] 0 500 10000 5001500 10002000 15002500 2000 download packets [Number] 0 0 501000 1002000 1503000The200 final4000 set250 of0 parameters10000 20001000 selected3000 2000 for4000 traf-3000 4000 Completion time Completion[sec] time [sec] Completion time [sec] Completion time [sec] are theCompletion most time efficient [sec] applications. CompletionCompletion time time [sec] [sec] Completion time [sec] 4 Skype 4 youTubefic characterization are the ones4 that present the 4 Http 4 Skype x 10 x 1010000 Shareaza 3 Based on the collected statistics,6 a series of most significant (graphical)300 differences between protocols. For the conducted study, the follow- 5000 2 bidimensional graphics were made4 representing 200 the relationships between the parameters out- ing parameters were identified as good candidates 1 0 for this set: flow duration,100 total number of pack- putted by TSTAT (each graph relates2 download packets [Number] to0 a pair100 200 300 400 500 of parameters).download packets [Number] As an example, Figure 2 presents Completion timeets, [sec] total number of ACK messages, total number download packets [Number] 0 0 download packets [Number] 0 download packets [Number] download packets [Number] download packets [Number]

download packets [Number] of payload bytes, total number of SYN messages, 0 different500 plots,1000 one1500 per application,Fig. 1.2000 Number of0 downloaded the number1000 packets versus2000 completion3000 time, fo4000r each application.0 1000 2000 3000 4000 Completion time [sec] Completion timevalid [sec] round trip time and average roundCompletion trip time [sec] time. of downloadedSkype packets as a function of the flow youTube duration, while Figure 3 presents the10000 plots corre- Having agreed on a relevant set of parame- sponding to the number of uploaded bytes as a ters (that can be somewhat extensive) selected function of the number of valid round5000 trip times from all parameters outputted by TSTAT, the for each application. This kind of graphs provide next step is to apply PCA in order to obtain a

0 download packets [Number] 0 100 200 300 400 500 download packets [Number] Ubiquitous ComputingCompletion and Communication time [sec] Journal 6

Fig. 1. Number of downloaded packets versus completion time, for each application. 4 x 10 2 2000

1.5 1500

1 1000

0.5 500 download packets [Number] download packets [Number]

0 0 0 500 1000 1500 2000 2500 0 50 100 150 200 250 Completion time [sec] Completion time [sec]

4 4 10000 x 10 x 10 2.5 5 8000 2 4

6000 1.5 3

4000 1 2

2000 0.5 1 download packets [Number] download packets [Number] download packets [Number] 0 0 0 0 1000 2000 3000 4000 0 500 1000 1500 2000 0 1000 2000 3000 4000 Completion time [sec] Completion time [sec] Completion time [sec] 300 10000

8000 200 6000

100 4000

2000 download packets [Number] download packets [Number] 0 0 0 1000 2000 3000 4000 0 100 200 300 400 500 Completion time [sec] Completion time [sec]

Figure 2: Number of downloaded packets versus completion time for: (from left to right and top to bottom) bitTorrent, Browsing, eMule, HTTP, Shareaza, Skype and youTube. quantitative feedback on the most important pa- ered, because a number of areas higher than rameters that have to be considered for identifi- 5 lead to very small areas obviously contain- cation. Principal components are calculated for ing too few points); each possible combination of parameters (taken 3. for each area, verify if the number of points from the selected set), allowing for the identifica- corresponding to a certain application (the tion of the combinations whose first two princi- one that is being analyzed) is significantly pal components account for more than a certain higher than the number of points corre- threshold (that was empirically taken as 75%) of sponding to the other applications. This the data variability percentage. In this way, ignor- operation is performed for each elementary ing the other principal components does not lead area: to a significant loss of relevant information. The combination having the largest number of param- (a) areas where the number of points cor- eters that was able to fullfil the imposed require- responding to the analyzed applica- ment, for both upload and download, is shown in tion is not significant (lower than a Table 3. pre-established threshold, which was The next step of the identification methodol- taken as 1.5%) are automatically dis- ogy is the bidimensional identification of charac- carded; teristic traffic patterns. In order to accomplish (b) the areas where the number of points this, the following steps are executed: corresponding to the analyzed appli- 1. calculation of the minimum and maximum cation is significant but at least one values of the first two principal components, of the other applications has a non- considering all applications (Figure 4); negligible number of points (higher than another pre-established thresh- 2. division of the bidimensional space on a pre- old, which was taken as 1%) are fur- defined number of rectangular areas (this is ther subdivided in a number of areas an input parameter of the algorithm - in- that is equal to the pre-defined num- teger values between 2 and 5 were consid- ber of areas and step 3 is repeated.

Ubiquitous Computing and Communication Journal 7 bitTorrent Upload data bytes per rtt count Browsing Upload data bytes per rtt count eMule Upload data bytes per rtt count bitTorrent Upload data bytes per rtt count Browsing Upload data bytes per rtt count 4eMule Upload data bytes per rtt count 8000 150 x4 10 8000 bitTorrent Upload data150 bytes per rtt count Browsing Upload datax bytes102 per rtt count eMule Upload data bytes per rtt count bitTorrent Upload data bytes per rtt count Browsing Upload data bytes per rtt count eMule Upload data bytes2 per rtt count 4 8000 4150 8000 6000 bitTorrentbitTorrent Upload data bytes per rtt count Upload data150 bytes per rtt count BrowsingBrowsing Upload data bytes per rtt count Upload datax 10 bytes per rtt count 1.5eMuleeMule Upload data bytes per rtt count Upload data bytesx 10 per rtt count 4 4 2 600080008000 150100150 2 x 1.5x 1010 100 2 2 6000 6000 1.5 1.5 4000 100 1

4000] 6000 100 ] 1.5] 1 s 6000 s s 1.5 e e e t t t

y y 10010050 y B 4000 B B 1 data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes

4000 K 50K 1 K [ [ [

data [KBytes] bytes 2000 data [KBytes] bytes data [KBytes] bytes 0.5 s s s

2000e 4000 e 0.5e 1

t 4000 t t 1

y y 50 y

data [KBytes] bytes 50 data [KBytes] bytes data [KBytes] bytes b b b

data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes

a 2000 a a 0.5 2000 t t 0.5 t

a a 5050 a data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes

d 0 d 0 d 0 0200020000 1000 2000 3000 0 0 10 20 30 40 50 0.500.5 0 2000 4000 6000 8000 10000 0 1000 2000 3000 0 10 20 30 40 50 0 2000 4000 6000 8000 10000 0 rtt count [Number]0 0 rtt count0 [Number]0 rtt count [Number]0 rtt count [Number] 0 1000 2000 3000rtt count [Number] 0 10 20 30 40 50rtt count [Number]0 2000 4000 6000 8000 10000 0 1000 200000 Http 3000Upload data bytes0 per rtt10 count 20 30 0040Streaming50 Upload data0 bytes2000 per rtt count4000 6000 080000 Shareaza10000 Upload data bytes per rtt count rtt count [Number] Http Upload data bytes per rtt count rtt countrtt count [Number] [Number]Streaming Upload data bytes per rtt countrtt countrtt count [Number] [Number] Shareaza Upload data bytes per rtt count rtt count [Number] 80 00 10001000 20002000 30003000 100 1010 2020 3030 4040 5050 0400 20002000 40004000 60006000 800080001000010000 Http Upload data bytes80 per rtt count rttrtt count [Number count [Number]Streaming] Http Upload Upload data data bytes bytes1 per per rtt rtt count count rttrtt count [Number] count [Number]ShareazaStreaming Upload Upload data databytes bytes 40per rtt per count rtt count rtt rtt count [Number]count [Number] Shareaza Upload data bytes per rtt count bitTorrent Upload data bytes per rtt count Browsing Upload data bytes per rtt count eMule Upload data bytes per rtt count 80 4 HttpHttp Upload data bytes per rtt count Upload data bytes1 80 per rtt count StreamingStreaming Upload data bytes per rtt count Upload data40 bytes1 per rtt count ShareazaShareaza Upload data bytes per rtt count Upload data bytes40 per rtt count 8000 150 x 10 60 30 2 60 8080 11 304040 60 30 6000 60 1.5 30

40] 60 ] 0.5 30] 20 100 40 s 60 0.5s 20s 30 e e e t t t y y y

B 40 B 0.5 B 20

40 K 0.5 K 20 K

4000 1 [ [ [

data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes s s s data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes

20e 40 e 0.5 20e 10 20 t 40 t 0.5 10t 20 y y y b b b data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes

50 data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes a a a

20 t 20 t 10 t 10

2000 0.5 a a a data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes d 20 d 10d 0 020 0 0 010 0 0 20 40 60 80 100 0 1 2 3 0 500 1000 1500 2000 0 0 0 0 0 20 40 600 0 80 100 0 1 0 02 3 0 500 1000 0 1500 2000 rtt countrtt count [Number] [Number]0 20 40 60 80 100rtt countrtt count [Number] [Number]0 1 2 3rtt countrtt count[Number] [Number]0 500 1000 1500 2000 0 1000 2000 3000 0 10 20 30 40 50 0 020 200040 400060 60000080 800010010000 0 1 2 00 3 0 500 1000 15000 0 2000 rtt count [Number] rtt count [Number] rtt countrtt [Number] count [Number]Skype00 Skype Upload20 20Upload data 40data40 bytes bytes per6060 per rtt rttcount80 count80 rtt count100rtt100 count [Number] [Number]0youTube0 youTube Upload Upload11 data data bytes bytes per22 rttper count rtt countrtt count3rtt3 count [Number] [Number]0 0 500500 10001000 15001500 20002000rtt count [Number] rtt count [Number] rtt count [Number] rtt count [Number] Http Upload data bytes per rtt count Streaming Upload data bytes per rtt count SkypeShareaza Upload dataUpload bytes data 3per bytes3 rtt countper rtt count rtt count [Number]youTubeSkype Upload Upload data data bytes bytes40 per40 per rtt rtt count count rtt count [Number]youTube Upload data bytes per rtt count rtt count [Number] 80 1 3 40 SkypeSkype Upload data bytes per rtt count Upload data40 bytes3 per rtt count youTubeyouTube Upload data bytes per rtt count Upload data bytes40 per rtt count 33 4040 30 30 60 30 2 2 30 30 2 30

2 ] ]

s s 30

e e 20 t 2 20t 40 0.5 20 y 2 y B B 20

K 20 K [ [

1 1s s 20 e e data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes data [KBytes] bytes t t 20 y y data [KBytes] bytes data [KBytes] bytes 1 data [KBytes] bytes 1 10

b 10b data [KBytes] bytes data [KBytes] bytes

20 data [KBytes] bytes 10 data [KBytes] bytes a a

t 1 10 t 10 data [KBytes] bytes data [KBytes] bytes a 1 a d d 1010 0 0 0 0 0 0 0 0 20 40 60 80 100 0 1 2 3 0 0 500 10000 0 1500 50502000 01000100 150150 0 0 10 10 20 20 30 030 40 40 50 50 rtt count [Number] rtt count [Number] 0 50 rtt count 100[Number]00 150rtt countrtt count [Number] [Number]0 0 10 2050 30 1000040 50150rtt countrtt count [Number] [Number]0 10 20 30 40 50 rtt count [Number] 00 5050 100100 rtt count150rtt150 count [Number] [Number]00 1010 2020 3030 4040 50rtt50 count [Number] Skype Upload data bytes per rtt count youTube Upload data bytes per rtt count rtt count [Number] rtt count [Number] rtt count [Number] rtt count [Number] 3 40 Figure 3: Number of uploaded bytes versus round trip time count for: (from left to right and top to

30 bottom) bitTorrent, Browsing, eMule, HTTP, Shareaza, Skype and youTube. 2 20

1 Figure 5 graphically illustrates the application of In order to assess the efficiency of the pro- data [KBytes] bytes data [KBytes] bytes 10 this algorithm to some example statistical data. posed identification methodology, the above ex- The above mentioned percentages were empiri- plained algorithm was applied to two data sets: 0 0 0 50 100 150 0 10 20 30 40 50 cally chosen and the proposed methodology is in- a training set and a testing set having approxi- rtt count [Number] rtt count [Number] dependent from these values. Obviously, results mately the same size. For the same combination depend on them, so they must be carefully cho- of parameters that was able previously identified, sen. the results obtained for the download traffic are illustrated in Table 5 for both data sets. Simi- This approach allows the identification of the lar results were also obtained for the upload data. predominance areas for each application. For As can be seen, for both data sets it was possi- P2P file sharing applications, it was possible to ble to identify graphical areas that limit a high identify traffic areas (corresponding both to up- percentage of points corresponding to a certain load/download) that limit more than 75% of their application and include an insignificant percent- points. Besides, other applications have no ex- age of points from the other applications. So, the pression in these areas: less than 1% of their obtained results confirm the efficiency of the pro- points are located there. These results offer a high posed methodology. certainty degree on the identification of these type of applications. Note that the number of areas is dependent on the traffic type (upload/download) Figures 6 and 7 show the two first principal and application, but in this case less than 6 dis- components for the BitTorrent and youTube up- tinct areas is necessary to identify more than 75% load and download traffic, respectively. Similar of the points corresponding to each P2P file shar- graphs were also obtained for the other applica- ing application. Table 4 presents the results ob- tions but are not presented here due to lack of tained per application, in terms of the total num- space. As can be seen, the same plots correspond- ber of points that fall inside the defined areas, for ing to the training and testing sets are very similar download traffic. to each other.

Ubiquitous Computing and Communication Journal 8 Table 3: Combination with the largest number of parameters that was able to fullfil the imposed re- quirement. Components: Completion time, rtt count, data bytes, ACK sent, packets Application Upload Download bitTorrent 93.04% 96.57% browsing 75.18% 96.94% eMule 99.29% 90.44% HTTP 97.07% 99.99% Shareaza 91.78% 99.59% Skype 99.77% 99.93% youTube 89.86% 99.99%

Figure 4: Calculation of the minimum and maximum values, for all applications.

7 CONCLUSIONS AND FURTHER RE- - number of sessions where an RST message was SEARCH sent; (iv) ACK - total number of ACK messages sent; (v) Pure ACK - total number of ACK mes- As the number and diversity of IP applications sages sent without payload; (vi) KBytes - total increase, it becomes more and more important number of Kbytes sent in message payloads; (vii) to accurately map Internet traffic to their cor- Data packets - total number of messages sent with responding applications. Network management payload; (viii) KBytes (w/ retrans) - total number and measurement tasks like traffic engineering, of Kbytes sent in message payloads, including re- service differentiation, performance/failure mon- transmissions; (ix and x) rexmit packets/KBytes - itoring, and security can greatly benefit from this total number of messages/KBytes retransmitted; mapping ability. Since traditional mapping ap- (xi and xii) SYN/FYN count - total number of proaches have important limitations when applied SYN/FYN messages sent; (xiii) SACK sent - to- to some specific identification scenarios, this pa- tal number of SACK messages sent; (xiv and xv) per proposed a methodology, based on Princi- rtx RTO/FR - total number of messages retrans- pal Component Analysis, to identify the differ- mitted due to Timeout/Fast Retransmit; (xvi) re- entiating characteristics of Internet applications. ordering - total number of messages for sequence The results obtained by applying the proposed reordering; (xvii) unknown - total number of mes- methodology to several IP applications, includ- sages out of sequence or duplicated, without clas- ing P2P applications, show that this method can sification; (xviii) unnece rtx RTO - total num- be efficiently used to identify characteristic traffic ber of messages unnecessarily transmitted due to patterns of IP applications and can constitute the timeout. basis for an efficient traffic identification tool. ACKNOWLEDGEMENTS APPENDIX

The columns of Table 1 have the following This work was done under the scope of the mean: (i) #Flows - number of identified flows; Euro-FGI and Euro-NF Networks of Excellence, (ii) packets - number of packets sent; (iii) RST funded by the European Union.

Ubiquitous Computing and Communication Journal 9 0.4 The final set of parameters selected for traffic characteriza- 0.3 tion are the ones that present the most significant (graphical) 0.2 differences between protocols. Having agreed on a relevant 0.1 set of parameters (that can be somewhat extensive) selected 0 from all parameters outputted by TSTAT, the next step is −0.1 to apply PCA in order to obtain a quantitative feedback on −0.2 the most important parameters that have to be considered for −0.3 identification. Principal components are calculated for each −0.4 possible combination of parameters (taken from the selected −0.6 −0.5 −0.4 −0.3 −0.2 −0.1 set), allowing for the identification of the combinations whose first two principal components account for more thanFigure a certai 5n: AlgorithmFig. 2. Algorithm for the for bidimensional the bidimensional identification identification of of characteristic characteristic traffic traffic patterns. threshold (that was taken as 75%) of the data variability per- patterns. centage. In this way, ignoring the other principal components Table 4: Total numberUpload Principal of Component points Analysis contained (bitTorrent) inUpload the Principal defined Component areas Analysis for(eMule) download traffic. does not lead to a significant loss of relevant information. 10 10 5 The next step of the identification methodology is the 5 bidimensional identification of characteristic traffic patterns. 0 0 -5

In order to accomplish this, the following steps are executed: -10 -5 2nd Principal Component 2nd Principal Component 1) calculation of the minimum and maximum values of the -15 -10 -20 -5 0 5 10 15 20 -10 0 10 20 30 40 first two principal components, considering all applica- 1st Principal Component 1st Principal Component Upload Principal Component Analysis (Shareaza) Upload Principal Component Analysis (Http) tions; 20 25

2) division of the bidimensional space on a pre-defined 10 20

15 number of rectangular areas (input parameter - is was 0 10 concluded that 5 was the best choice); -10 5 2nd Principal Component 3) for each area, verify if the number of points corre- -20 2nd Principal Component 0 sponding to a certain application (the one that is being -30 -5 -10 0 10 20 30 40 -5 0 5 10 15 20 25 30 analyzed) is significantly higher than the number of 1st Principal Component 1st Principal Component points corresponding to the other applications. This Fig. 3. PCA for upload traffic. operation is performed for each elementary area: a) areas where the number of points corresponding to the analyzed application is not significant (lower more than 75% of the points corresponding to each P2P file than a pre-established threshold, which was taken sharing application. Table III presents the results obtained per as 1.5%) are automatically discarded; application, in terms of the total number of points that fall b) the areas where the number of points correspond- inside the defined areas, for download traffic. ing to the analyzed application isReferences significant but In order to assess the efficiencytion of the Systems, proposed Networks iden- and Digital Signal at least one of the other applications has a non- tification methodology, the above explainedProcessing algorithm (CSNDSP’08) was , 2008. negligible number of points (higher[1]than K. Gummadi, another R.applied Dunn, to S. several Saroiu, sets S. Grib- of statistical data and, for all cases, [4] S. Sen and J. Wang, “Analyzing peer-to-peer pre-established threshold, which was takenble, asH. 1%) Levy, andit was J. Zahorjan, possible to “Measure- identify graphical areas that limit a high ment, modeling, and analysis of a peer-to- traffic across large networks,” IEEE/ACM are further subdivided in a number of areas that is percentage of points corresponding to aTransactions certain applicatio on Networkingn and , vol. 12, no. 2, equal to the pre-defined number of areaspeer and file-sharing step include workload,” an insignificant in Proceedings percentage of points from the other of the 19th ACM Symposium on Operating pp. 219–232, 2004. 3 is repeated. Systems Principlesapplications. 2003, 2003. So, the obtained results confirmed the efficiency of the proposed methodology. [5] D. Moore, K. Keys, R. Koga, E. Lagache, Figure 2 graphically illustrates the application of this algorithm and K. C. Claffy, “The coralreef software Figures 3 and 4 show the two first principal components for to some example statistical data. The above mentioned[2] T. Karagiannis, per- A. Broido, N. Brownlee, suite as a tool for system and network ad- centages were empirically chosen and the proposedK. C. method- Claffy, andthe M. BitTorrent, Faloutsos, “IseMule, P2P Shareazadying and HTTPministrators,” file sharing in uploadProceedings of the 15th ology is independent from these values. Obviously,or just results hiding?,”and in IEEE download Global traffic, Telecommu- respectively. SimilarUSENIX graphs conference were alsono Systems Administra- depend on them so they must be carefully chosen.nications Conferenceobtained 2004 for, 2004. the other applications buttion are (LISA’01) not presented, 2001, du pp.e 133–144. This approach allows the identification of predominance to lack of space. [3] P. Salvador R. Valadas R. Nogueira, [6] http://www.winmx.com, “Winmx,” . areas for each application. For P2P file sharing applications, A. Nogueira, “IdentifyingV. CONCLUSIONS differentiating AND FURTHER[7] http://www.nynode.info, RESEARCH “Winny,” . it was possible to identify traffic areas (correspondingcharacteristics both to of internet applications using upload/download) that limit more than 75% of theirprincipal points. componentAs theanalysis,” diversity in ofProceed- IP applications[8] T. increases, Karagiannis, it becomes A. Broido, M. Faloutsos, and Besides, other applications have no expression in theseings of are theas: 6thmore Symposium and more on Communica- important to accuratelyK. Claffy, map Internet “Transport traffic layer identification of less than 1% of their points are located there. These results to their corresponding applications. Network management and offer a high certainty degree on the identification of these type measurementUbiquitous tasks Computing like traffic and Communication engineering, service Journal differ- 10 of applications. Note that the number of areas is dependent entiation, performance/failure monitoring, and security can on the traffic type (upload/download) and application, but greatly benefit from this mapping ability. This paper proposed a maximum of 10 distinct areas was necessary to identify a methodology, based on Principal Component Analysis, to Table 5: Results obtained for theUpload training Principal and Component the testing Analysis sets. (bitTorrent) Upload Principal Component Analysis (bitTorrent) - SubSet 1 Components: Completion10 time, rtt count, 15

data bytes, ACK sent, packets 10 Application Upload5 Download bitTorrent 98.10% 96.30% 5 browsing 96.90%0 98.44% 0 eMule 93.36% 90.27% -5 HTTP 99.99%-5 99.99% 2nd Principal Component 2nd Principal Component Shareaza 95.45% 99.61% -10 Skype 97.52% 100.00% -10 -15 youTube 99.99%-5 99.99%0 5 10 15 20 -5 0 5 10 15 20 25 30 1st Principal Component 1st Principal Component

Upload Principal Component Analysis (bitTorrent) Upload Principal Component Analysis (bitTorrent) - UploadSubSet 1Principal Component Analysis (bitTorrent) - SubSet 2 Upload Principal Component Analysis (bitTorrent) - All 10 15 4 15

10 10 2 5 5 5 0 0 0 0 -2 -5 -5 -5 2nd Principal Component 2nd Principal Component 2nd Principal Component 2nd Principal Component -4 -10 -10

-10 -15 -6 -15 -5 0 5 10 15 20 -5 0 5 10 15 20 25 30 -2 0 2 4 6 8 10 12 14 -5 0 5 10 15 20 25 30 1st Principal Component 1st Principal Component 1st Principal Component 1st Principal Component

Upload Principal Component Analysis (bitTorrent) - SubSet 2 Upload Principal Component Analysis (bitTorrent) - All 4 15

10 2

5 0 0 -2 -5

2nd Principal Component -4 2nd Principal Component -10

-6 -15 -2 0 2 4 6 8 10 12 14 -5 0 5 10 15 20 25 30 1st Principal Component 1st Principal Component Figure 6: PCA for bitTorrent traffic: (top left) upload training set; (top right) upload testing set; (bottom left) download training set; (bottom right) download testing set.

P2P traffic,” in Proceedings of the ACM SIG- internet application traffic measurement and COMM Internet Measurement Conference, analysis,” in Proceedings of the IEEE/IFIP 2004, pp. 121–134. NOMS Conference, 2004.

[9] S. Sen, O. Spatscheck, and D. Wang, “Ac- [13] M. Roesch, “Snort: Lightweight intrusion de- curate, scalable in-network identification of tection for networks,” in Proceedings of the p2p traffic using application signatures,” in 13th USENIX Conference on Systems Ad- Proceedings of the WWW Conference, 2004. ministration (LISA’99), 1999, pp. 229–238.

[10] A. Madhukar and C. Williamson, “A lon- [14] V. Paxson, “Bro: a system for detecting net- gitudinal study of p2p traffic classification,” work intruders in real-time,” Computer Net- in Proceedings of the MASCOTS Conference, works, , no. 31. 2006. [15] P. Haffner, S. Sen, O. Spatscheck, and [11] A. W. Moore and D. Papagiannaki, “Toward D. Wang, “Acas: Automated construction the accurate identification of network appli- of application signatures,” in Proceedings of cations,” in Proceedings of the 6th Passive the SIGCOMM05 Workshops, 2005. Active Measurements Workshop, 2005, vol. 3431, p. 41. [16] http://www.snort.org, “Snort,” .

[12] T. Choi, C. Kim, S. Yoon, J. Park, H. Kim, [17] P. Barford, J. Kline, D. Plonka, and A. Ron, H. Chung, and T. Jesong, “Content-aware “A signal analysis of network traffic anoma-

Ubiquitous Computing and Communication Journal 11 Upload Principal Component Analysis (youTube) Upload Principal Component Analysis (youTube) - SubSet 1 12 10

10 8

8 6 6 4 4 2 2 2nd Principal Component 2nd Principal Component 0 0

-2 -2 -4 -2 0 2 4 6 -2 -1 0 1 2 3 4 5 6 Download Principal1st Principal Component Component Analysis (youTube) Download Principal Component1st Principal ComponentAnalysis (youTube) - SubSet 1 5 4 Upload Principal Component Analysis (youTube) Upload Principal Component Analysis (youTube) - SubSetUpload 1 Principal Component Analysis (youTube) - SubSet 2 Upload Principal Component Analysis (youTube) - All 4 12 10 10 123

10 8 83 10 2 8 8 6 62 6 16 4 41

4 2nd Principal Component 2nd Principal Component 4 0 2 20 2 2 2nd Principal Component 2nd Principal Component 2nd Principal Component 2nd Principal Component -1 -1 0 0 0-5 0 5 10 15 20 -20 0 2 4 6 8 10 12 14 1st Principal Component 1st Principal Component -2 -2 -2 -2 -4 -2 0 2 4 6 -2 -1 0 1 2 3 4 5 6 -4 -2 0 2 4 6 -4 -2 0 2 4 6 Download Principal1st Principal Component Component Analysis (youTube) Download Principal Component1st Principal Component Analysis (youTube) Download- SubSet 1 Principal Component1st Principal Component Analysis (youTube) - SubSet 2Download Principal 1stComponent Principal Component Analysis (youTube) - All 5 4 1 5 Upload Principal Component Analysis (youTube) - SubSet 2 Upload Principal Component Analysis (youTube) - All 4 0 10 123

3 -1 8 10 2 2 8 -2 0 6 16 1 -3 4 2nd Principal Component 2nd Principal Component 2nd Principal Component 2nd Principal Component 4 0 0 -4 2 2

2nd Principal Component -1 2nd Principal Component -1 -5 -5 0-5 0 5 10 15 20 0-2 0 2 4 6 8 10 12 14 -2 0 2 4 6 8 10 12 14 -5 0 5 10 15 20 1st Principal Component 1st Principal Component 1st Principal Component 1st Principal Component -2 -2 -4 -2 0 2 4 6 -4 -2 0 2 4 6 Download Principal Component1st Principal Component Analysis (youTube) - SubSet 2Download Principal1st Component Principal Component Analysis (youTube) - All 1 Figure 7: PCA for5 youTube traffic: (top left) upload training set; (top right) upload testing set; (bottom left) download training set; (bottom right) download testing set. 0

-1 lies,” in In Proceedings of the ACM IMW sampled packet streams,” in Proceedings of -2 Conference, 2002.0 the IMW Conference, 2002. -3

2nd Principal Component [18] http://www.cachelogic.com,2nd Principal Component “Cache logic,” . [27] N. Duffield, C. Lund, and M. Thorup, “Flow -4 [19] http://www.packeteer.com, “Packeteer,” . sampling under hard resource constraints,” -5 -5 -2 0 2 4 6 8 10 12 14 -5 0 5 10 15 20 in Proceedings of the SIGMETRICS Confer- 1st Principal Component[20] S. Dharmapurikar, P. Krishnamurthy,1st Principal Component T.S. ence, 2004. Sproull, and J.W. Lockwood, “Deep packet inspection using parallel bloom filters?,” [28] C. Estan, K. Keys, D. Moore, and G. Vargh- IEEE/Micro, vol. 24, no. 1. ese, “Building a better netflow,” in In Pro- ceedings of the SIGCOMM Conference, 2004. [21] T. Kocak and I. Kaya, “Low-power bloom fil- ter architecture for deep packet inspection,” [29] R. Kompella and C. Estan, “The power of IEEE/Communications Letters, vol. 10, no. slicing in internet flow measurement,” in In 3. Proceedings of the IMC Conference, 2005. [22] A. Broder and M. Mitzenmacher, “Network applications of bloom filters: a survey,” In- [30] M. Roughan, S. Sen, O. Spatscheck, and ternet Mathematics, vol. 1, no. 4. N. Duffield, “Class-of-service mapping for QoS: A statistical signature-based approach [23] Cisco IOS Documentation, “Network- to IP traffic classification,” in Proceedings of based application recognition and dis- the ACM SIGCOMM Internet Measurement tributed network-based application recogni- Conference, 2004, pp. 135–148. tion,” 2006. [31] A. Moore and D. Zuev, “Internet traffic clas- [24] T. Karagiannis, K. Papagiannaki, and sification using bayesian analysis,” in Pro- M. Faloutsos, “BLINC: multilevel traffic ceedings of International Conference on Mea- classification in the dark,” in Proceedings surement and Modeling of Computer Sys- of the Conference on Applications, Technolo- tems, 2005, pp. 50–60. gies, Architectures, and Protocols for Com- puter Communications, 2005. [32] F. Constantinou and P. Mavrommatis, [25] http://www.cisco.com, “Cisco netflow,” . “Identifying known and unknown peer-to- peer traffic,,” in Proceedings of Fifth IEEE [26] N. Duffield, C. Lund, and M. Thorup, “Prop- International Symposium on Network Com- erties and prediction of flow statistics from puting and Applications, 2006, pp. 93–102.

Ubiquitous Computing and Communication Journal 12 [33] L. Bernaille, R. Teixeira, and I. Akodkenou, traffic flow classification,” ACM SIGCOMM “Traffic classification on the fly,” Computer Computer Communication Review, vol. 36, Communication Review, vol. 36, no. 2, pp. no. 5. 239–26, 2006. [38] T. Auld, A. W. Moore, and S. F. Gull, [34] L. Bernaille and R. Teixeira, “Early recog- “Bayesian neural networks for internet traffic nition of encrypted applications,” in In Pro- classification,” IEEE Transactions on Neural ceedings of the 8th Passive and Active Mea- Networks, vol. 18, no. 1, pp. 223–239, 2007. surement Conference (PAM 2007), 2007. [39] F. Shen, C. Pan, and X. Ren, “Research of [35] H. Liu, W. Feng, Y. Huang, and X. Li, “A P2P traffic identification based on BP neural peer-to-peer traffic identification method us- network,” in Proceedings of the Third inter- ing machine learning,” in Proceedings of the national Conference on international infor- International Conference on Networking, Ar- mation Hiding and Multimedia Signal Pro- chitecture and Storage, 2007. cessing (IIH-MSP 2007), 2007. [40] A. Ali and R. Tervo, “Traffic identification [36] R. Yuan Z. Li and X. Guan, “Accurate clas- using artificial neural network,” in Canadian sification of the internet traffic based on the Conference on Electrical and Computer En- SVM method,” in In Proceedings of the 42th gineering, vol. 1, pp. 667–672. IEEE International Conference on Commu- nications (ICC 2007), 2007. [41] I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, 1986. [37] N. Williams, S. Zander, and G. Armitage, “A preliminary performance comparison of five [42] http://tstat.tlc.polito.it/index.shtml, “Tstat machine learning algorithms for practical IP - tcp statistic and analysis tool,” .

Ubiquitous Computing and Communication Journal 13