GPU Accelerated Protocol Analysis for Large and Long-Term Traffic Traces
Total Page:16
File Type:pdf, Size:1020Kb
GPU ACCELERATED PROTOCOL ANALYSIS FOR LARGE AND LONG-TERM TRAFFIC TRACES Submitted in fulfilment of the requirements of the degree of DOCTOR OF PHILOSOPHY of Rhodes University Alastair Timothy Nottingham Grahamstown, South Africa March 2016 Abstract This thesis describes the design and implementation of GPF+, a complete general packet classification system developed using Nvidia CUDA for Compute Capabil- ity 3.5+ GPUs. This system was developed with the aim of accelerating the ana- lysis of arbitrary network protocols within network traffic traces using inexpensive, massively parallel commodity hardware. GPF+ and its supporting components are specifically intended to support the processing of large, long-term network packet traces such as those produced by network telescopes, which are currently difficult and time consuming to analyse. The GPF+ classifier is based on prior research in the field, which produced a proto- type classifier called GPF, targeted at Compute Capability 1.3 GPUs. GPF+ greatly extends the GPF model, improving runtime flexibility and scalability, whilst main- taining high execution efficiency. GPF+ incorporates a compact, lightweight register- based state machine that supports massively-parallel, multi-match filter predicate evaluation, as well as efficient arbitrary field extraction. GPF+ tracks packet com- position during execution, and adjusts processing at runtime to avoid redundant memory transactions and unnecessary computation through warp-voting. GPF+ additionally incorporates a 128-bit in-thread cache, accelerated through register shuffling, to accelerate access to packet data in slow GPU global memory. GPF+ uses a high-level DSL to simplify protocol and filter creation, whilst better facil- itating protocol reuse. The system is supported by a pipeline of multi-threaded high-performance host components, which communicate asynchronously through 0MQ messaging middleware to buffer, index, and dispatch packet data on the host system. The system was evaluated using high-end Kepler (Nvidia GTX Titan) and entry- level Maxwell (Nvidia GTX 750) GPUs. The results of this evaluation showed high system performance, limited only by device side IO (600MBps) in all tests. GPF+ maintained high occupancy and device utilisation in all tests, without significant serialisation, and showed improved scaling to more complex filter sets. Results were used to visualise captures of up to 160 GB in seconds, and to extract and pre-filter captures small enough to be easily analysed in applications such as Wire- shark. Acknowledgements This research would not have been possible without the support provided by my friends and family. I would like to thank my supervisor, Prof. Barry Irwin, who provided invaluable assistance and guidance for over half a decade in shaping and developing this project through its various iterations. I would like to thank my family, who have shown tremendous support, both emotionally and financially, to help me complete this research. In particular, I would like to thank my wife, Dr Spalding Lewis, for her monumental effort in editing this thesis and supporting me throughout the development and writing process. I would also like to thank and my father, Jeremy Nottingham, who provided significant financial support and invalu- able proofreading as the document was nearing completion. I would additionally like to thank Caro Watkins, as well as the staff of the Hamilton Building, for all of their help during my many, many years in the department. Finally, I would like to thank Joey Van Vuuren and the Defence Peace Safety and Security (DPSS) unit of the Council for Scientific and Industrial Research (CSIR), and Armscor, for their support, and for generously funding this research. Contents I Introduction 1 1 Introduction 2 1.1 Research Context . 3 1.1.1 IBR and Network Telescopes . 3 1.1.2 Packet Classification . 5 1.1.3 General Processing on Graphics Processing Units . 6 1.2 Problem Statement . 7 1.3 Research Overview . 9 1.3.1 Research Scope . 9 1.3.2 Research Methodology . 10 1.3.3 Research Goals . 11 1.4 Document Overview . 13 II Background 16 2 General Processing on Graphics Processing Units 17 2.1 Introduction to GPGPU . 18 2.1.1 Brief History of GPGPU . 18 2.1.2 Compute Unified Device Architecture (CUDA) . 19 2.1.3 Benefits and Drawbacks . 20 2.2 CUDA Micro-architectures . 22 2.2.1 Tesla (CC 1.x) . 22 2.2.2 Fermi (CC 2.x) . 23 2.2.3 Kepler (CC 3.x) . 23 2.2.4 Maxwell (CC 5.x) . 24 2.2.5 Future Architectures . 24 2.3 CUDA Programming Model . 25 2.3.1 CUDA Kernels and Functions . 26 i CONTENTS ii 2.3.2 Expressing Parallelism . 27 2.3.3 Thread Warps . 28 2.3.4 Thread Block Configurations . 28 2.3.5 Preparing and Launching Kernels . 29 2.4 Kepler Multiprocessor . 29 2.5 Global Memory . 31 2.5.1 Coalescing and Caches . 31 2.5.2 Texture References and Objects . 32 2.5.3 CUDA Arrays . 33 2.5.4 Read-Only Cache . 33 2.5.5 Performance Comparison . 34 2.6 Other Memory Regions . 35 2.6.1 Registers . 35 2.6.2 Constant Memory . 36 2.6.3 Shared Memory . 36 2.7 Inter-warp Communication . 37 2.7.1 Warp Vote Functions . 37 2.7.2 Warp Shuffle Functions . 38 2.8 Performance Considerations . 40 2.8.1 Transferring Data . 40 2.8.2 Streams and Concurrency . 41 2.8.3 Occupancy . 43 2.8.4 Instruction Throughput . 44 2.8.5 Device Memory Allocation . 45 2.9 Summary . 46 3 Packets and Captures 48 3.1 Packets . 49 3.1.1 Protocol headers . 49 3.1.2 Packet Size and Fragmentation . 50 3.2 The TCP/IP Model . 51 3.2.1 Link Layer . 51 3.2.2 Internet Layer . 52 3.2.3 Transport Layer . 52 3.2.4 Application Layer . 53 3.2.5 The OSI Model . 54 CONTENTS iii 3.3 Pcap Capture Format . 55 3.3.1 Global Header . 57 3.3.2 Record Header . 58 3.3.3 Format Limitations . 59 3.3.4 Storage Considerations . 60 3.3.5 Indexing . 61 3.4 PcapNg Format . 62 3.4.1 Blocks . 62 3.4.2 Mandatory Blocks . 63 3.4.3 Optional Blocks . 64 3.4.4 Comparison to Pcap Format . 66 3.5 Libpcap and WinPcap . 66 3.5.1 Interfacing with Capture Files . 67 3.5.2 WinPcap Performance . 68 3.6 Wireshark . 69 3.6.1 Capture Analysis Process . 70 3.6.2 Scaling to Large Captures . 71 3.7 Summary . 73 4 Packet Classification 75 4.1 Classification Overview . 76 4.1.1 Process . 76 4.1.2 Algorithm Types . 77 4.2 Filtering Algorithms . 78 4.2.1 BPF . 79 4.2.2 Mach Packet Filter . 81 4.2.3 Pathfinder . 81 4.2.4 Dynamic Packet Filter . 83 4.2.5 BPF+ . 83 4.2.6 Recent Algorithms . 84 4.3 Routing Algorithms . 85 4.3.1 Approaches . 86 4.3.2 Trie Algorithms . 87 4.3.3 Cutting Algorithms . 87 4.3.4 Bit-Vectors . 90 4.3.5 Crossproducting . 90 CONTENTS iv 4.3.6 Parallel Packet Classification (P 2C) . 91 4.4 GPF . 93 4.4.1 Approach . 93 4.4.2 Filter Grammar . 94 4.4.3 Classifying Packets . 96 4.4.4 Architecture Limitations . 98 4.5 Related GPGPU Research . 99 4.6 Summary . 101 III Implementation 103 5.