Performance Benchmarking and Tuning for Container Networking on Arm

Trevor Tao, Jianlin Lv, Jingzhao Ni, Song Zhu Sep/2020

© 2020 Arm Limited (or its affiliates) Agenda

• Introduction • Container Networking Interfaces(CNIs) on arm • Benchmarking metrics, environment and tools • Benchmarking results • Initial Performance Analysis with tools • Future Work(Provisional)

2 © 2020 Arm Limited (or its affiliates) Introduction

© 2020 Limited Introduction What is CNI? • CNI (Container Network Interface), a Cloud Native Computing Foundation project, consists of a specification and libraries for writing plugins to configure network interfaces in containers, along with a number of supported plugins. • CNI concerns itself only with network connectivity of containers and removing allocated resources when the container is deleted. • CNI has a wide range of support and the specification is simple to implement but not the implementation itself for its extensions. • CNI are the de-facto Kubernetes networking support • We need to know how they perform on arm platform Kubernetes Networking Model • Kubernetes makes opinionated choices about how Pods • Networking objects are networked: • Container-to-Container networking • all Pods can communicate with all other Pods • Pod-to-Pod networking without using network address translation (NAT). • Pod-to-Service networking • all Nodes can communicate with all Pods without • Internet-to-Service networking NAT. • The IP that a Pod sees itself as is the same IP that others see it as.

4 © 2020 Arm Limited (or its affiliates) Container Networking Interfaces(CNIs) on arm

© 2020 Limited High Performance CNIs available for Arm Edge Stack

Things now available in Akraino IEC Arm edge stack as a ref:

IEC Arm Edge Stack

Calico Cilium Contiv-VPP OVN-K8s SRIOV Flannel

• Direct physical • Widely used and • pure IP networking • Linux-Native, API- • uses FD.io VPP to interfaces(PF/VFs) almost easiest fabric Aware Networking deployment for a provide network • OVS/OVN- support for Pods and Security for simple K8s • high-level network connectivity controller based Containers • High performance networking policy between PODs K8s networking with direct Linux management by • Linux eBPF based solution kernel eth driver or• Linux network bridge • Native DPDK iptables network policy, DPDK PMD driver for pod connection interface support • Rather good load balance and and overlay based • Good scalability for phy NIC performance with • Usually co-work with communication for security which is OVS inherited other CNIs, such as inter-hosts access • Support believed to be with • Native VPP Flannel, Calico by direct(non-overlay) incredible ACL/NAT based • Use OVN logical Multus or other glue• Easy to be integrated and overlay(IPINIP, performance network policy and switches/routers to CNI into other container VxLAN) network connect Pods and networking solution, Repo: • access • Need resource connection L3 networking for outside access e.g., Cilium between hosts • Good performance description or • No good network https://gerrit.akraino.o • Easy deployment but with rather • No OVS-DPDK annotation when do • Good scalability policy support complex support now the configuration for rg/r/admin/repos/iec • Calico-VPP appears too configuration CNI and Pod setup • Hard to debug 6 6 © 2020 Arm Limited (or its affiliates) CNI Networking Models Flannel Cilium

Ref. and modified from web source Quote from web source Backend: Tested version: v0.11.0 Backend: Tested version: IPIP, VXLAN VXLAN, Direct Master branch compiled at 2020-09-09 Routing(not tested 7 © 2020 Arm Limited (or its affiliates) now) 7 CNI Networking Models

Calico Kubernetes Service Implementation

Quote from web source

8 © 2020 Arm Limited (or its affiliates) Tested version: v3.13.2 Benchmarking metrics, environment and tools

© 2020 Limited Benchmarking metrics, topology and tools Benchmarking Metrics • Protocols: TCP, UDP, HTTP(s) • TCP, UDP Metrics: bandwidth in Mbits/sec, Gbits/sec, round-trip delay in ms • HTTP(s): Bandwidth in Mbits/sec, Gbits/sec, CPS(Connection per Second), RPS(Request per Second) Tools: Network connection: • IPerf, WRK 10Gbps connection by Ethernet Controller XXV710----→82599ES 10-Gigabit Server Platform SFI/SFP+ Network Connection 10fb Architecture: aarch64 CPU max MHz: 2500.0000 NUMA node0 CPU(s): 0-xxx Byte Order: Little Endian CPU min MHz: 1000.0000 NUMA node1 CPU(s): xxx- CPU(s): xxx BogoMIPS: 400.00 yyy On-line CPU(s) list: 0-xxx L1d cache: xxK Thread(s) per core: 4 L1i cache: xxK L2 cache: xxxK L3 cache: xxxxK 10 © 2020 Arm Limited (or its affiliates) 10 Benchmarking metrics, environment and tools

IPerf(v2) test topology: Wrk (http performance) test topology:

Nginx

Test Command: Client: Test command: iperf -c ${SERVER_IP} -t ${time} -i 1 -w wrk -t12 -c1000 -d30s 100K -P 4 http://$IP/files/$file

11 © 2020 Arm Limited (or its affiliates) Server: Iperf -s 11 Benchmarking Results

© 2020 Limited Benchmarking Results of TCP Throughput for CNIs with Different Backends Node to Pod TCP Performance for IPIP(Calico), IPIP(Flannel), VXLAN(Flannel), Observation for TCP performance over CNIs VXLAN(Cilium) and Direct Routing(no Tunnel, Calico) Inter-Hosts 10Gb/s ether connection • The performance gap between CNIs are not so explicit when overlay tunnel is used; 12 • Calico and Flannel show a little bit better 10 performance than Cilium for most MTUs here 8 BW • With IPIP/VXLAN overlay tunnel enabled, the 6 (Gbps) larger MTU size, the throughput(BW) 4 performance is better. 2 • When use direct routing(here by Calico, Cilium 0 also support this mode), the throughput 1500 2000 3000 4000 5000 6000 7000 8000 9000 performance is not significantly affected by MTU MTU size (Byte) size. • The performance of direct routing here by Calico, Calico IPIP Tunnel Flannel IPIP Tunnel Flannel VXLAN Tunnel Cilium VXLAN Tunnel Calico Direct Routing(no tunnel) Cilium also support this mode) is better than IPIP Pod to Pod Performance for IPIP(Calico), IPIP(Flannel), VXLAN(Flannel), enabled. VXLAN(Cilium) and Direct Routing(no Tunnel, Calico) • The IPIP tunnel is a little better than VXLAN tunnel Inter Hosts 10Gb/s ether connection • In general, the node to pod TCP performance is better than that of pod 2 pod which flows one 12 more step ( of veth connection to the Linux 10 kernel) . 8 BW Finally, compared with different scenarios, it proves 6 (Gbps) that IPIP/VXLAN overlay tunnel which are now 4 implemented in the is the key factor 2 which affects the performance of CNIs on arm 0 1500 2000 3000 4000 5000 6000 7000 8000 9000 Question: MTU (Byte) Why the node to pod performance is no better than that of pod to pod case for Cilium? 13 © 2020 ArmCalico Limited IPIP (or Tunnel its affiliates)Flannel IPIP Tunnel Flannel VXLAN Tunnel Cilium VXLAN Tunnel Calico Native Routing HTTP Performance Benchmarking for Calico CNI

Pod2Pod HTTP Performance with Calico IPIP Overlay for Cross- Host Communication 1000 Initial observation: • MTU has a rather 900 bigger effect on 800 767.05 775.54 760.33 the performance when accessing 700 large files, but when the 600 accessed file size 510.09 BW is small, it has 500 little effect (MB/s) 400 369.53 • The accessed file size is a major 300 factor to the HTTP performance 200 when there is only 100.82 a small number of 100 parallel threads 9.09 0 • When the file size 600B 10KB 100KB 1MB 10MB 100MB 500MB is big enough, the performance can’t File size to be accessed by Wrk be improved much even with bigger MTU: 1480 2000 3000 4000 5000 6000 7000 8000 8980 MTUs 14 © 2020 Arm Limited (or its affiliates) HTTP Performance Benchmarking for Calico CNI Pod2Pod HTTP Performance with Calico non-IPIP Overlay for Cross-Host Communication 1200 1120 1120 1120 Initial observation: 1000 1000 • Almost the same as that of IPIP • The file size has much 800 more significant performance impact BW 596.93 than the MTU 600 (MB/s) • For file size > =10MB, the MTU has little effect 400 to the final performance • The performance is much higher than those 200 of IPIP when file size >= 114.96 100KB(See next page) 9.78 0 600B 10KB 100KB 1MB 10MB 100MB 500MB File size to be accessed by Wrk

MTU: 1500 2000 3000 4000 5000 6000 7000 8000 9000 Question: Why for small file size, the performance of smaller MTU is even better than those of large MTUs? 15 © 2020 Arm Limited (or its affiliates) Wrk: thread 5, connections: 10 HTTP Performance Benchmarking for Calico CNI

Pod2Pod HTTP Performance with Calico IPIP vs non-IPIP for Cross- Host Communication

1200 1130 1130 1130

1020 Initial observation: 1000 • For file size > =10MB, the MTU 800 has little effect to the final performance BW 580.31 600 (MB/s) • The performance is much higher than those of IPIP 400 when file size >= 100KB 200 • When MTU is 107.6 small, the 10.28 performance gap 0 between IPIP and 600B 10KB 100KB 1MB 10MB 100MB 500MB non-IPIP is higher File size to be accessed by Wrk

IPIP-MTU-1480 non-IPIP-MTU-1500 IPIP-MTU-5000 MTU: non-IPIP-MTU-5000 IPIP-MTU-8980 non-IPIP-MTU-9000 16 © 2020 Arm Limited (or its affiliates) Wrk: thread 5, connections: 10 HTTP Performance Benchmarking for Calico CNI

Host2Pod vs Host2Service HTTP Performance with Calico IPIP and Observation: non-IPIP for Cross-Host Communication • The performance gap is minor 1200 when accessing small files 1090 1090 1090 • For small file size, the 1010 host2pod and host2service 1000 performance is almost the same, which means the 800 service access(by iptables configured by kube-proxy) is BW not the bottleneck for HTTP 600 559 (MB/s) service • The performance of non-IPIP is 400 much higher than those of IPIP when file size >= 100KB 200 • For large MTU and large file 102.9 size, the host2pod 9.12 performance is better than 0 host2svc. 600B 10KB 100KB 1MB 10MB 100MB 500MB File size to be accessed by Wrk • For non-IPIP, the performance gap between different MTU is IPIP-MTU1480-Host2Svc non-IPIP-MTU1500-Host2Svc IPIP-MTU1480-Host2Pod not so explicit, so it’s believed the IPIP is actually the non-IPIP-MTU1500-Host2Pod IPIP-MTU3000-Host2Svc non-IPIP-MTU3000-Host2Svc bottleneck, which is the same IPIP-MTU3000-Host2Pod non-IPIP-MTU3000-Host2Pod as previous

17 © 2020 Arm Limited (or its affiliates) Wrk: thread 5, connections: 10 HTTP Performance Benchmarking for CNIs with various backends Pod2Pod HTTP Performance of CNIs for inter-hosts communication 1200 Initial observation: • For file size > =10MB, 1000 the MTU has little effect to the final performance for 800 different CNIs • When the file size is small, different CNIs BW 600 has little (MB/s) performance gap • When the file size is 400 large (>=100KB), it shows Calico and Cilium performance much better than 200 Flannel, especially for 76.34 99.01 95.26 94.03 95.28 93.1 large MTUs. 8.63 • The performance is 0 much higher than 600B 10KB 100KB 1MB 10MB 100MB 500MB those of IPIP when file size >= 100KB File size to be accessed by Wrk • When MTU is small, the performance gap Calico IPIP-MTU-1480 Cilium VXLAN MTU 1500 Flannel VXLAN MTU 1450 Flannel IPIP MTU 1480 between overlay and Calico non-IPIP-MTU-1500 Calico IPIP-MTU-5000 Cilium VXLAN MTU 5000 Flannel VXLAN MTU 4950 non-overlay is higher Flannel IPIP MTU 4980 Calico non-IPIP-MTU-5000 Calico IPIP-MTU-8980 Cilium VXLAN MTU 9000 Flannel VXLAN MTU 8950 Flannel IPIP MTU 8980 Calico non-IPIP-MTU-9000 18 © 2020 Arm Limited (or its affiliates) HTTP Performance Benchmarking for CNIs with various backends Observation: • For the 3 CNIs, the Host2Service HTTP Performance for CNIs for Cross-Host Communication performance gap is minor when accessing small files 1200 • As previous, the direct routing (no tunnel)mode shows the 1000 best performance compared with any other overlay based 800 approaches; • For file size >= 100KB, the 600 Calico shows explicitly the best performance over other 2 CNIs

400 • Flannel shows the worst host2service performance over other 2 CNIs,even with 200 larget MTUs, for either IPIP tunnel or VXLAN tunnel 0 • For large MTU and large file 600B 10KB 100KB 1MB 10MB 100MB 500MB size, Cilium shows similar Calico IPIP-MTU-1480 Cilium VXLAN MTU 1500 Flannel VXLAN MTU 1450 Flannel IPIP MTU 1480 performance with the Calico CNI Calico non-IPIP-MTU-1500 Calico IPIP-MTU-5000 Cilium VXLAN MTU 5000 Flannel VXLAN MTU 4950 • For non-IPIP, the performance Flannel IPIP MTU 4980 Calico non-IPIP-MTU-5000 Calico IPIP-MTU-8980 Cilium VXLAN MTU 9000 gap between different MTU is Flannel VXLAN MTU 8950 Flannel IPIP MTU 8980 Calico non-IPIP-MTU-9000 not so explicit, so it’s believed that the tunnel communication is actually the 19 © 2020 Arm Limited (or its affiliates) bottleneck, which is the same as previous Initial Performance Analysis with perf tools

© 2020 Limited Initial Performance Analysis with perf tools

Possible performance analysis tools: • • Perf • DTrace The Flamegraphs are got by the following commands: • #perf record -e cpu-clock -F 1000 -a -g -C 2 -- sleep 20 • #perf script | FlameGraph/stackcollapse-perf.pl > out.perf-folded • #cat out.perf-folded | FlameGraph/flamegraph.pl > perf-kernel.svg The Flamegraph script package is got by: • git clone https://github.com/brendangregg/FlameGraph.git

21 © 2020 Arm Limited (or its affiliates) Issues: Performance Analysis for IPerf with Flame Graph 2 flame graphs for w/wo IPIP tunnel of performance test

IPIP Tunnel No tunnel

22 © 2020 Arm Limited (or its affiliates) 22 Summary and Future Work

© 2020 Limited Brief Summary

With the performance tests for CNIs over arm64 platform, initially we got: • All 3 CNIs (Calico, Cilium, Flannel) utilize the Linux kernel overlay tunnel implementation to enable its cross-host pod and service communication • The TCP throughput performance gap between CNIs are not so explicit when overlay tunnel is used; • For TCP throughput, Calico and Flannel show a little bit better performance than Cilium for most MTUs here • With IPIP/VXLAN overlay tunnel enabled, the larger MTU size, the throughput(BW) performance is better. • The overlay tunnel approaches (IPIP, VXLAN) actually affects the performance either TCP or HTTP performance much compared with direct routing; • For HTTP performance, the Calico and Cilium shows much better performance over Flannel CNI

24 © 2020 Arm Limited (or its affiliates) Future Work(Provisional) • Performance testing for supported senior features of CNIs • Kube-proxy replacement with eBPF for Cilium CNI • Encryption added for pod2pod communication of Cilium CNI • eBPF introduced for Calico CNI • HTTP performance testing with network policy configured (Cilium, Calico) • Further performance trace, analysis and optimization for known performance issues • Performance testing for other CNIs on arm: Ovn-Kubernetes, Contiv/VPP • More backend types testing for Cilium, Calico or other CNIs • Compare with other platform (x86_64, …) • Investigation on the performance differences between CNIs • …

25 © 2020 Arm Limited (or its affiliates) Thank You Danke Merci 谢谢 ありがとう Gracias Kiitos 감사합니다 धन्यवाद شك ًرا ধꇍযবাদ תודה

© 2020 Arm Limited (or its affiliates)