<<

APUNet: Revitalizing GPU as Packet Processing Accelerator

Younghwan Go, Muhammad Asim Jamshed, YoungGyoun Moon, Changho Hwang, and KyoungSoo Park

School of Electrical Engineering, KAIST GPU-accelerated Networked Systems

• Execute same/similar operations on each packet in parallel • High parallelization power • Large memory bandwidth

CPU GPU Packet Packet Packet Packet Packet Packet

• Improvements shown in number of research works • PacketShader [SIGCOMM’10], SSLShader [NSDI’11], Kargus [CCS’12], NBA [EuroSys’15], MIDeA [CCS’11], DoubleClick [APSys’12], …

2 Source of GPU Benefits

• GPU acceleration mainly comes from memory access latency hiding • Memory I/O  switch to other thread for continuous execution

GPU Quick GPU Thread 1 Thread 2 Context Thread 1 Thread 2 Switch … … … … a = b + c; a = b + c; d =Inactivee * f; Inactive d = e * f; … … … … v = mem[a].val; … v = mem[a].val; … Memory Prefetch in I/O background

3 Memory Access Hiding in CPU vs. GPU

• Re-order CPU code to mask memory access (G-Opt)* • Group prefetching, software pipelining

Questions: Can CPU code optimization be generalized to all network applications? Which processor is more beneficial in packet processing?

*Borrowed from G-Opt slides *Raising the Bar for Using GPUs in Software Packet Processing [NSDI’15] 4 Anuj Kalia, Dong Zhu, Michael Kaminsky, and David G. Anderson Contributions

• Demystify processor-level effectiveness on packet processing algorithms • CPU optimization benefits light-weight memory-bound workloads • CPU optimization often does not help large memory workloads • GPU is more beneficial for compute-bound workloads • GPU’s data transfer overhead is the main bottleneck, not its capacity

• Packet processing system with integrated GPU w/o DMA overhead • Addresses GPU kernel setup / data sync overhead, and memory contention • Up to 4x performance over CPU-only approaches!

5 Discrete GPU

• Peripheral device communicating with CPU via a PCIe lane

Host GPU PCIe Lanes CPU DRAM L2 Cache Streaming Multiprocessor (SM) High computation power Graphics Processing x N Scheduler Registers High memory bandwidth Cluster Fast inst./data access Fast context switch Instruction Cache GDDR Device Memory L1 Cache Shared Memory Require CPU-GPU DMA transfer!

6 Integrated GPU

• Place GPU into same die as CPU  share DRAM • AMD Accelerated Processing Unit (APU), Intel HD Graphics

Host CPU Unified Northbridge DRAM No DMA transfer!

GPU High computation power Graphics Northbridge L2 Cache Fast inst./data access Compute Compute Unit Fast context switch Unit Low power & cost Scheduler Registers L1 x N Cache APU

7 CPU vs. GPU: Cost Efficiency Analysis

• Performance-per-dollar on 8 popular packet processing algorithms • Memory- or compute-intensive • IPv4, IPv6, Aho-Corasick pattern match, ChaCha20, , SHA-1, SHA-2, RSA • Test platform • CPU-baseline, G-Opt (optimized CPU), dGPU w/ copy, dGPU w/o copy, iGPU

CPU / Discrete GPU APU / Integrated GPU CPU Intel Xeon E5-2650 v2 (8 @ 2.6 GHz) CPU AMD RX-421BD (4 @ 3.4 GHz) GPU NVIDIA GTX980 (2048 @ 1.2 GHz) GPU AMD R7 Graphics (512 @ 800 MHz) RAM 64 GB (DIMM DDR3 @ 1333 MHz) RAM 16 GB (DIMM DDR3 @ 2133 MHz) Cost CPU: $1143.9 dGPU: $840 Cost iGPU: $67.5

8 Cost Effectiveness of CPU-based Optimization

• G-Opt helps memory-intensive, but not compute-intensive algorithms • Computation capacity as bottleneck with more computations

IPv6 table lookup AC pattern matching SHA-2 2 1.8 2.5 4 3.5 Detailed analysis on CPU-based optimization2.0 in the paper  1.5 2 3 1.5 1.0 2 1 1.0 1 1.0 1.0 1 0.5 0.5 0 Normalized perf per dollar Normalized perf per dollar Normalized perf per dollar Normalized perf per dollar 0 0 CPU G-Opt dGPU CPU G-Opt CPU G-Opt w/ copy

9 Cost Effectiveness of Discrete/Integrated GPUs

• Discrete GPU suffers from DMA transfer overhead • Integrated GPU is most cost efficient!

IPv4 table lookup ChaCha20 2048-bit RSA decryption 3.5 20 16 14.4 3.08 Our approach: 17.0 3 12 2.5 Use integrated GPU15 to accelerate packet processing! 2 10 8 1.5 1.00 1.08 4.8 1 5 4.3 4 0.5 1.0 1.0 0 0 0 Normalized perf perNormalized dollar perf per dollar Normalized perf per dollar G-Opt dGPU dGPU Normalized perf per dollar G-Opt dGPU iGPU G-Opt dGPU iGPU w/ copy w/o copy w/o copy w/o copy

10 Contents

• Introduction and motivation • Background on GPU • CPU vs. GPU: cost efficiency analysis • Research Challenges • APUNet design • Evaluation • Conclusion

11 Research Challenges

Set input • Frequent GPU kernel setup overhead Redundant Launch kernel • Overhead exposed w/o DMA transfer overhead! Teardown kernel Retrieve result

• High data synchronization overhead APU • CPU-GPU cache coherency CPU Bottleneck! Explicit DRAM GPU Sync! BW: 10x↓ • More contention on shared DRAM Cache • Reduced effective memory bandwidth

APUNet: a high-performance APU-accelerated network packet processor

12 Persistent Thread Execution Architecture

• Persistently run GPU threads without kernel teardown • Master passes packet pointer addresses to GPU threads

CPU Shared Virtual Memory (SVM) GPU

Master Thread 0 Thread 1 Workers Packet Thread 2 Packet I/O Packet Thread 3 Packet NIC Packet Pool Pointer Array Persistent Threads

Packet

13 Data Synchronization Overhead

• Synchronization point for GPU threads: L2 cache • Require explicit synchronization to main memory

Shared Virtual Memory P P P P P Master

CPU

GPU Thread 0 Thread 1 Graphics L2 Thread 2 Thread 3 Update result? Northbridge Cache Thread 4 Thread 5 Need explicit Can process one Thread 6 Thread 7 sync! request at a time! 14 Solution: Group Synchronization

• Implicitly synchronize group of packet memory GPU threads processed • Exploit LRU cache replacement policy

Shared Virtual Memory Master P P P P P Group 1 Group 2 32 0 For more details andVerify tuning/optimizations, please refer to our CPUpaper  correctness! GPU GPU Thread Group 1 GPU Thread Group 2 GPU Thread 0 Thread 31 Thread 0 L2 Cache Poll Poll Poll DP DP DP Process Process Dummy Mem I/O DP DP DP Barrier Barrier Barrier

15 Zero-copy Based Packet Processing

Option 1. Traditional method with discrete GPU Option 2. Zero-copy between CPU-GPU

NIC CPU COPY GPU NIC COPY CPU GPU

Standard (e.g., mmap) GDDR (e.g., cudaMalloc) Standard (e.g., mmap) Shared (e.g., clSVMAlloc)

High overhead! High overhead!

• Integrate memory allocation for NIC, CPU, GPU

NIC CPU GPU No copy overhead! Shared (e.g., clSVMAlloc)

16 Evaluation

APUNet (AMD Carrizo APU) 40 Gbps NIC Client (packet/flow generator) RX-421BD (4 cores @ 3.4 GHz) Mellanox ConnectX-4 Xeon E3-1285 v4 (8 cores @ 3.5 GHz) R7 Graphics (512 cores @ 800 MHz) 32GB DRAM 16GB DRAM

• How well does APUNet reduce latency and improve throughputs? • How practical is APUNet in real-world network applications?

17 Benefits of APUNet Design

• Workload: IPsec (128-bit AES-CBC + HMAC-SHA1) Packet Processing Latency Synchronization Throughput GPU-Copy GPU-ZC GPU-ZC-PERSIST (64B Packet) 6 20 5.31 5 16 4 12 3 5.7x 8 2

5.4x Throughput (Gbps) 0.93

Packet (us) latency Packet 4 1 1.5x 0 0 64 128 256 512 1024 1451 Atomics Group sync Packet size (bytes) 18 Real-world Network Applications

• 5 real-world network applications • IPv4/IPv6 packet forwarding, IPsec gateway, SSL proxy, network IDS IPsec Gateway SSL Proxy CPU Baseline G-Opt APUNet CPU Baseline G-Opt APUNet 20 5000 16.4 4241 3583 15 4000 2x 3000 1801 1540 2.75x 10 7.7 8.2 1791 5.3 2000 1539 5 2.6 2.8 1000 HTTP trans/sec HTTP

Throughput (Gbps) 0 0 64 1451 256 8192 Packet size (bytes) Number of concurrent connections

19 Real-world Network Applications

• Snort-based Network IDS Network IDS • Aho-Corasick pattern matching CPU Baseline G-Opt APUNet DFC 12 9.710.4 • No benefit from CPU optimization! 10 • Access many data structures 8 6 4x • Eviction of already cached data 3.6 3.6 4.2 4 2.6 2.3 2.4 2

• DFC* outperforms AC-APUNet Throughput (Gbps) 0 • CPU-based algorithm 64 1514 • Cache-friendly & reduces memory access Packet size (bytes)

*DFC: Accelerating String Pattern Matching for Network Applications [NSDI’16] Byungkwon Choi, Jongwook Chae, Muhammad Jamshed, KyoungSoo Park, and Dongsu Han 20 Conclusion

• Re-examine the efficacy of GPU-based packet processor • GPU is bottlenecked by PCIe data transfer overhead • Integrated GPU is the most cost effective processor • APUNet: APU-accelerated networked system • Persistent thread execution: eliminate kernel setup overhead • Group synchronization: minimize data synchronization overhead • Zero-copy packet processing: reduce memory contention • Up to 4x performance improvement over CPU baseline & G-Opt APUNet High-performance, cost-effective platform for real-world network applications

21 Thank you.

Q & A