Methods and Algorithms for Network Traffic Classification Marcin Pietrzyk
Total Page:16
File Type:pdf, Size:1020Kb
Methods and Algorithms for Network Traffic Classification Th`esepr´esent´ee pour obtenir le grade de docteur de T´el´ecomParis Tech Marcin Pietrzyk April 2011 Directeurs de Th`ese: Professor Dr. Guillaume Urvoy-Keller, CNRS/UNSA Dr. Jean-Laurent Costeux, Orange Labs ii iii iv Abstract Traffic classification consists of associating network flows with the application that generated them. This subject, of crucial importance for service providers, network managers and companies in general has already received substantial attention in the research community. Despite these efforts, a number of issues still remain unsolved. The present work consists of three parts dealing with various aspects of the challenging task of traffic classification and some of its use cases. The first part presents an in-depth study of state-of-the-art statistical clas- sification methods which use passive traces collected at the access network of an ISP offering ADSL access to residential users. We critically evaluate the perfor- mance, including the portability aspect, which so far has been overlooked in the community. Portability is defined as the ability of the classifier to perform well on sites (networks) different than the one it was trained on. We uncover impor- tant, previously unknown issues, and analyze their root cause using numerous statistical tools. A last contribution of the first part is a method, which allows us to mine the unknown traffic (from a deep packet inspection tool perspective). The second part aims at providing a remedy for some of the problems un- covered in part one, mainly the ones concerning portability. We propose a self- learning hybrid classification method that enables synergy between a priori het- erogeneous sources of information (e.g. statistical flow features and the presence of a signature). We first extensively evaluate its performance using the data sets from part one. We then report the main findings for tool deployment in an oper- ational ADSL platform, where a hybrid classifier monitored the network for more than half a year. The last part presents a practical use case of traffic classification and focuses on profiling the customers of an ISP at the application level. The traffic classifier here is the input tool for the study. We propose to rely on the hierarchical clustering technique to group the users into a limited number of sets. We put emphasis both on heavy hitters and less active users, as the profile of the former (p2p) differs from the one of the second group. Our work complements recent findings of other works reporting an increasing trend of HTTP driven traffic becoming dominant at the expense of p2p traffic. v vi Acknowledgments First of all, I would like to express my deep gratitude to my thesis supervisor, Prof. Dr. Guillaume Urvoy-Keller, who suggested that I pursue this research direction and who convinced me it was possible, despite my doubts. You have continuously supported all my efforts, both in the research and in my personal projects, not to mention the invaluable support and countless contributions to this work. Thank you for everything, it was a truly great experience working with you! I would also like to thank my co-advisor in Orange, Dr. Jean-Laurent Cos- teux, who was my guide in the jungle of large corporations. Thanks for all your valuable input in this thesis. Big thanks are also due to all my collaborators, whose experience and knowl- edge contributed greatly to this work, especially Dr. Taoufik En-Najjary, whose brilliant ideas and incredible theoretical knowledge made many parts of this work possible, it was a great pleasure to have you on my side! Thanks to Louis Plis- sonneau for sharing the good and the difficult parts of a PhD students life. I would also like to thank Prof. Dr. Rui Valadas, Prof. Dr. Rosario de Oliveira and Denis Collange for their input on the features stability evaluation and the fruitful discussions. I would like to thank all my colleagues from Orange Labs, especially Frederic and Mike, my office mates, Chantal, Patrick, Olivier, Paul and Seb, as well as all the non-permanent people that I had the chance to meet during the three years spent at the Lab. Last but not least, I would like to thank my interns, especially Michal, for his great input in the development of the monitoring tools. Finally, I would like to express my deepest gratitude to my parents and girlfriend for their support in all my life choices. Well, I owe you everything! vii viii Contents List of Figures xv List of Tables xix 1 Introduction 1 1.1 Organization of the Thesis . 2 I Statistical Methods Evaluation and Application 3 2 Introduction 5 2.1 Contributions - Part I . 6 2.2 Relevant Publications for Part I . 6 3 Background on Traffic Classification 7 3.1 Definition of the Traffic Classification Problem . 7 3.2 Performance Metrics . 7 3.3 Ground Truth . 8 3.4 Methods and Related Work . 10 4 Methods Evaluation - Background 13 4.1 Traffic Data . 13 4.1.1 Dataset . 13 4.1.2 Reference Point . 14 4.1.3 Traffic Breakdown . 14 4.2 Methodology . 16 4.2.1 Classification Algorithms . 16 4.2.2 Features . 17 4.2.3 Training Set . 18 4.2.4 Flow Definition - Data Cleaning . 18 ix CONTENTS 4.3 Summary . 19 5 Method Evaluation 21 5.1 Classification - Static Case . 21 5.1.1 Number of Packets - Set A . 21 5.1.2 Static Results . 22 5.1.3 Static Results - Discussion . 24 5.2 Classification - Cross Site . 25 5.2.1 Overall Results . 25 5.2.2 Detailed Results for Set A (packet sizes) . 26 5.2.3 Detailed Results for Set B (advanced statistics) . 27 5.3 Impact of the Port Numbers . 27 5.3.1 Impact of the Port Numbers - Static Case . 28 5.3.2 Impact of the Port Numbers - Cross Site Case . 29 5.4 Impact of the Algorithm . 30 5.5 Discussion . 31 6 Methods evaluation - Principal Components 33 6.1 PCA . 33 6.2 Description of the Flow Features . 34 6.3 Variable Logging . 37 6.4 Interpretation of the Principal Components . 37 6.5 Classification Results . 39 6.5.1 Static Case . 40 6.5.2 Cross Site Case . 41 6.6 Conclusions . 41 7 The Root Cause of the Cross-site Issue 43 7.1 Portability Problem . 43 7.2 Statistical Test and Effect Size . 43 7.2.1 Effect Size . 44 7.3 Stability Analysis . 46 7.3.1 EDONKEY and WEB . 47 7.3.2 BITTORRENT . 48 7.3.3 FTP . 51 7.3.4 CHAT and MAIL . 52 x M.Pietrzyk CONTENTS 7.4 Discussion . 52 8 Mining the Unknown Class 55 8.1 Unknown Mining Method . 55 8.2 Predictions . 56 8.3 Validation . 57 8.3.1 Peer-to-peer Predictions . 57 8.3.2 Web Predictions . 58 8.3.3 Throughput Distribution Comparison . 60 8.4 The Unknown Class - Discussion . 60 9 Conclusions of Part I 61 II Hybrid Classifier 63 10 Introduction 65 10.1 Contributions - Part II . 66 10.2 Relevant Publications for Part II . 66 11 Design and Evaluation 67 11.1 HTI - Hybrid Traffic Identification . 67 11.1.1 Features . 68 11.2 Machine Learning Algorithm . 69 11.2.1 Logistic Regression . 69 11.2.2 Training Phase . 71 11.2.3 Selection of Relevant Features . 72 11.3 Per Application Sub-models . 73 11.4 Off-line Evaluation: warming-up . 74 11.5 Off-line Evaluation: Performance Results . 76 11.5.1 Overall Results . 76 11.5.2 Multiple Classifications . 76 11.5.3 Which Method for Which Application? . 79 11.6 HTI vs. State of the Art Methods . 80 11.6.1 Legacy Ports Method . 80 11.6.2 Statistical Methods . 81 11.6.3 Impact of the Classification Algorithm . 81 xi CONTENTS 12 HTI in the Wild - Production Deployment 83 12.1 Implementation Details . 83 12.2 Platform Details . 84 12.3 Live Experiment Results . 85 12.3.1 One Week of Traffic . 88 12.3.2 Six Months of Traffic . 89 12.4 Model Validity . 89 12.5 Efficiency Issues . 91 12.5.1 CPU . 92 12.5.2 Memory . 93 12.5.3 Time to Classify . 93 12.6 Discussion and Limitations . 93 12.7 Conclusion . 94 III User Profiling 95 13 User Profiles 97 13.1 Introduction . 97 13.2 Relevant Publications for Part III . 97 13.3 Related Work . 97 13.4 Data Set . 98 13.5 Reliable Users Tracking . 99 13.6 Applications Classes and Traffic Breakdown . 99 13.6.1 Traffic Breakdown . 100 13.7 Distribution of Volumes Per User . 101 13.8 Users Profiling . 102 13.8.1 Users Dominating Application . 102 13.8.2 Top Ten Heavy Hitters . 104 13.9 Users Application Mix . 106 13.9.1 \Real" vs. \Fake" Usage . 106 13.9.2 Choice of Clustering . 108 13.9.3 Application Mix . 109 13.9.4 Profile Agreement . 110 13.9.5 Application Mix - Discussion . 111 13.10Conclusions.