Statistical, Real-Time Classification of IP Traffic in Linux Operating System
Total Page:16
File Type:pdf, Size:1020Kb
SILESIAN UNIVERSITY OF TECHNOLOGY FACULTY OF AUTOMATIC CONTROL, ELECTRONICS AND COMPUTER SCIENCE Master thesis Statistical, real-time classification of IP traffic in Linux operating system Author: Paweł Foremski Supervisor: dr inż. Arkadiusz Biernacki Gliwice, September 2011 Table of Contents 1. INTRODUCTION .............................................................................................................5 1.1. The problem of Internet traffic classification...........................................................6 1.2. Thesis goals..............................................................................................................9 1.3. Review of existing solutions .................................................................................10 1.3.1 Simple methods................................................................................................10 1.3.2 Deep Packet Inspection....................................................................................11 1.3.3 Modern approaches..........................................................................................11 1.4. Thesis contents.......................................................................................................13 2. SYSTEM DESCRIPTION...............................................................................................15 2.1. Main algorithm.......................................................................................................16 2.1.1 Feature extraction.............................................................................................17 2.1.2 Decision process..............................................................................................18 2.1.3 Modifications...................................................................................................18 2.2. System architecture................................................................................................20 2.2.1 Signature database...........................................................................................21 2.2.2 Training signatures and the SVM model.........................................................21 2.2.3 Traffic sources.................................................................................................22 2.2.4 Endpoint table..................................................................................................22 2.2.5 Feature extraction and the decision process.....................................................23 2.2.6 Classification results........................................................................................23 2.3. Methodology..........................................................................................................23 3. IMPLEMENTATION......................................................................................................25 3.1. Architecture............................................................................................................26 3.2. External libraries and facilities...............................................................................27 3.3. Main program: the libspi library............................................................................28 3.3.1 File list.............................................................................................................28 1 3.3.2 Data structures and variables...........................................................................29 3.3.3 Control flow and events...................................................................................35 3.3.4 Application Programming Interface................................................................40 3.4. Front-end: the spid program...................................................................................40 3.4.1 File list.............................................................................................................40 3.4.2 Data structures.................................................................................................41 3.4.3 Control flow and communication with libspi..................................................41 3.4.4 User interface...................................................................................................43 4. EVALUATION................................................................................................................45 4.1. Datasets..................................................................................................................46 4.2. Results....................................................................................................................47 4.2.1 Test 1: performance vs. training set size..........................................................47 4.2.2 Test 2: overall system performance.................................................................49 4.2.3 Test 3: unknown protocol detection.................................................................50 4.2.4 Test 4: processing speed..................................................................................51 4.3. Discussion..............................................................................................................53 4.3.1 Test 1................................................................................................................53 4.3.2 Test 2................................................................................................................53 4.3.3 Test 3................................................................................................................53 4.3.4 Test 4................................................................................................................54 5. CONCLUSIONS..............................................................................................................55 6. SUMMARY.....................................................................................................................56 7. APPENDIX: IMPLEMENTATION DETAILS...............................................................57 7.1. libspi data structures...............................................................................................57 7.1.1 Main structure: struct spi.................................................................................57 7.1.2 Internal events: struct spi_subscribers, spi_event_cb_t and struct spi_event..58 7.1.3 IP traffic: struct spi_source and struct spi_pkt.................................................59 7.1.4 Endpoints: struct spi_ep...................................................................................60 7.1.5 Signatures: struct spi_signature.......................................................................61 2 7.1.6 Classification results: struct spi_classresult.....................................................61 7.1.7 Performance evaluation: struct spi_stats..........................................................61 7.1.8 KISS algorithm: struct kissp............................................................................62 7.1.9 Complex decision process: struct verdict and struct ewma_verdict................62 7.2. libspi Application Programming Interface.............................................................63 7.3. spid data structures.................................................................................................64 7.4. spid data formats....................................................................................................65 7.4.1 Command-line source specification format.....................................................65 7.4.2 Packet trace index file format..........................................................................66 7.4.3 Signature database file format.........................................................................66 7.4.4 Endpoint classification output format..............................................................67 7.4.5 Performance metrics output format.................................................................68 8. LITERATURE.................................................................................................................69 9. SUMMARY IN POLISH.................................................................................................71 3 4 1.INTRODUCTION 1. INTRODUCTION The Internet has been constantly evolving since its inception. For more than a decade it has been growing in capacity and versatility with a great pace, often requiring the Internet Service Providers to update and extend their infrastructure in a timely manner. These changes are connected with the inventions of new kinds of computer software, which in turn generate new types of network traffic. However, the fundamental protocol of the Internet – the IP protocol – does not provide a robust and universal mean to differentiate one traffic type from another. Thus, identification of a particular application in Internet transmissions is not a trivial task, yet it is very important. For instance, a typical Internet end-user demands a safe and fast Internet access. An Internet Service Provider which is to fulfil such a requirement must be able to monitor the traffic for potential threats and to impose a proper prioritization on the traffic. Moreover, there are political and research organizations which monitor the global Internet. Observing the share of P2P traffic in Internet transmissions of a particular country could reveal trends in its society. Work in these areas