Replacing Iptables with Ebpf in Kubernetes with Cilium Cilium, Ebpf, Envoy, Istio, Hubble
Total Page:16
File Type:pdf, Size:1020Kb
Replacing iptables with eBPF in Kubernetes with Cilium Cilium, eBPF, Envoy, Istio, Hubble Michal Rostecki Swaminathan Vasudevan Software Engineer Software Engineer [email protected] [email protected] [email protected] What’s wrong with iptables? 2 What’s wrong with legacy iptables? IPtables runs into a couple of significant problems: ● Iptables updates must be made by recreating and updating all rules in a single transaction. ● Implements chains of rules as a linked list, so all operations are O(n). ● The standard practice of implementing access control lists (ACLs) as implemented by iptables was to use sequential list of rules. ● It’s based on matching IPs and ports, not aware about L7 protocols. ● Every time you have a new IP or port to match, rules need to be added and the chain changed. ● Has high consumption of resources on Kubernetes. Based on the above mentioned issues under heavy traffic conditions or in a system that has a large number of changes to iptable rules the performance degrades. Measurements show unpredictable latency and reduced performance as the number of services grows. 3 Kubernetes uses iptables for... ● kube-proxy - the component which implements Services and load balancing by DNAT iptables rules ● the most of CNI plugins are using iptables for Network Policies 4 And it ends up like that 5 6 What is BPF? 7 Linux Network Stack Process Process Process ● The Linux kernel stack is split into multiple abstraction layers. System Call Interface ● Strong userspace API compatibility in Linux for years. Sockets ● This shows how complex the linux kernel is and its years of evolution. TCP UDP Raw ● This cannot be replaced in a short term. Netfilter ● Very hard to bypass the layers. IPv4 IPv6 ● Netfilter module has been supported by linux for more Ethernet than two decades and packet filtering has to applied to packets that moves up and down the stack. Traffic Shaping Netdevice / Drivers HW Bridge OVS . 8 BPF kernel hooks Process Process Process System Call Interface BPF System calls Sockets BPF Sockmap and Sockops TCP UDP Raw Netfilter IPv4 IPv6 BPF cGroups Ethernet Traffic Shaping BPF TC hooks Netdevice / Drivers BPF XDP HW Bridge OVS . 9 Mpps 10 BPF replaces IPtables Local Processes PREROUTING INPUT FORWARD OUTPUT POSTROUTING IPTables Routing netfilter Decision FILTER hooks eBPF TC hooks NAT XDP hooks Routing Decision FILTER FILTER eBPF eBPF Code Code Routing Decision NAT NAT Netdev Netdev (Physical or (Physical or virtual Device) virtual Device) 11 BPF based filtering architecture To Linux From Linux Stack Stack TC/XDP Ingress TC Egress hook hook NetFilter NetFilter Connection Tracking Update Store Ingress [local dst] INGRESS session session Chain CHAIN Label Packet Selector [remote dst] Update FORWARD Store session CHAIN session Label Packet [local src] Update Store Egress Chain OUTPUT session session Selector CHAIN Netdev Label Packet Netdev (Physical or (Physical or virtual Device) virtual Device) [remote src] 12 BPF based tail calls LBVS is implemented Each eBPF program is Each eBPF program can exploit a with a chain of eBPF injected only if there are per cpu _array shared across the entire program chain different matching algorithm (e.g., programs, connected rules operating on that through tail calls. field. exact match, longest prefix match, Packet header offsets etc). eBPF Program #1 eBPF Program #2 eBPF Program #3 eBPF Program #N eBPF Program Search IP1 bitv1 * bitv1 rule1 act1 Headers call Tail IP.dst IP.proto first IP2 bitv2 udp bitv2 rule2 act2 parsing lookup lookup Matching Tail call Tail IP3 bitv3 tcp bitv3 rule rule3 act3 Tail call Tail …. call Tail Update rule1 cnt1 Bitwise counters rule2 cnt2 AND bit-vectors ACTION Header parsing is done (drop/ once and results are kept accept) in a shared map for performance reasons Packet out Packet in Bitvector with temporary result To eBPF hook From eBPF hook per cpu _array shared across the entire program chain 13 BPF goes into... ● Load balancers - katran ● perf ● systemd ● Suricata ● Open vSwitch - AF_XDP ● And many many others 14 BPF is used by... 15 Cilium 16 What is Cilium? 17 CNI Functionality CNI is a CNCF ( Cloud Native Computing Foundation) project for Linux Containers It consists of specification and libraries for writing plugins. Only care about networking connectivity of containers ● ADD/DEL General container runtime considerations for CNI: The container runtime must ● create a new network namespace for the container before invoking any plugins ● determine the network for the container and add the container to the each network by calling the corresponding plugins for each network ● not invoke parallel operations for the same container. ● order ADD and DEL operations for a container, such that ADD is always eventually followed by a corresponding DEL. ● not call ADD twice ( without a corresponding DEL ) for the same ( network name, container id, name of the interface inside the container). When CNI ADD call is invoked it tries to add the network to the container with respective veth pairs and assigning IP address from the respective IPAM Plugin or using the Host Scope. When CNI DEL call is invoked it tries to remove the container network, release the IP Address to the IPAM Manager and cleans up the veth pairs. 18 Cilium CNI Plugin control Flow Kubectl Kubernetes API Server Kubelet Userspace K8s Pod CRI-Containerd cni-add().. Container2 CNI-Plugin (Cilium) Container1 Cilium Agent bpf_syscall() BPF Maps Linux Kernel Network 000 c1 FE 0A Stack BPF 001 54 45 31 Hook 002 A1 B1 C1 004 32 66 AA Kernel eth0 19 Cilium Components with BPF hook points and BPF maps shown in Linux Stack Orchestrator U CILIUM POD (Control Plane) S E VM’s and Containers Apps R PLUGIN CILIUM CLI CILIUM MONITOR S Cont Cont Cont P VM1 1 2 3 App CILIUM HEALTH NAMESPACE A CILIUM AGENT DAEMON C CILIUM CILIUM HEALTH E OPERATOR CILIUM HOST_NET BPF Bpf_create_map K AF-INET AF-RAW SO_ATTACH_BPF (sockmap, s E sockopts R Virtual B N Net TCP/UDP Layer E Bpf_lookup_elements A BPF-Cilium L Devices P F IP Layer BPF-Cont1 S - F BPF-Cont2 P X BPF-Cont3 A D C E P Queueing and Forwarding TC BPF m NETWORK STACK with BPF hook points a p XDP Build sk_buff Device Driver s PHYSICAL LAYER ( NETWORK HARDWARE 20 Cilium as CNI Plugin K8s cluster K8s node K8s node K8s pod K8s pod K8s pod container A container B container C eth0 eth0 eth0 lxc0 lxc0 lxc1 Cilium Networking CNI eth0 eth0 21 Networking modes Encapsulation Direct routing Use case: Use case: Cilium handling routing between nodes Using cloud provider routers, using BGP routing daemon Node A Node A VXLAN VXLAN Node B Cloud or BGP routing VXLAN Node C Node B Node C 22 Pod IP Routing - Overlay Routing ( Tunneling mode) 23 Pod IP Routing - Direct Routing Mode 24 L3 filtering – label based, ingress Pod Pod Labels: role=frontend Labels: role=backend IP: 10.0.0.1 allow IP: 10.0.0.3 Pod Pod Labels: role=frontend Labels: role=frontend IP: 10.0.0.2 IP: 10.0.0.4 deny Pod IP: 10.0.0.5 25 L3 filtering – label based, ingress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow frontends to access backends" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - fromEndpoints: - matchLabels: class: frontend 26 L3 filtering – CIDR based, egress Cluster A IP: 10.0.1.1 allow Subnet: 10.0.1.0/24 Pod Labels: role=backend IP: 10.0.2.1 IP: 10.0.0.1 deny Subnet: 10.0.2.0/24 Any IP not belonging to 10.0.1.0/24 27 L3 filtering – CIDR based, egress apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow backends to access 10.0.1.0/24" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend egress: - toCIDR: - IP: “10.0.1.0/24” 28 L4 filtering apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "Allow to access backends only on TCP/80" metadata: name: "frontend-backend" spec: endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP” 29 L4 filtering Pod Labels: role=backend IP: 10.0.0.1 allow TCP/80 deny Any other port 30 L7 filtering – API Aware Security Pod Labels: role=api IP: 10.0.0.1 GET /articles/{id} Pod IP: 10.0.0.5 GET /private 31 L7 filtering – API Aware Security apiVersion: "cilium.io/v2" kind: CiliumNetworkPolicy description: "L7 policy to restict access to specific HTTP endpoints" metadata: name: "frontend-backend" endpointSelector: matchLabels: role: backend ingress: - toPorts: - ports: - port: “80” protocol: “TCP” rules: http: - method: "GET" path: "/article/$" 32 Standalone proxy, L7 filtering Node A Node B Envoy Pod A Envoy Pod B Generating BPF programs for Generating BPF programs for L7 filtering through libcilium.so L7 filtering through libcilium.so + BPF + BPF VXLAN Generating BPF programs Generating BPF programs for L3/L4 filtering for L3/L4 filtering 33 Features 34 Cluster Mesh Cluster A Cluster B Node A Node A Node B Pod A Pod A Pod B Pod C Container Container Container Container eth0 eth0 eth0 eth0 + BPF + BPF + BPF External etcd 35 Istio (Transparent Sidecar injection) without Cilium K8s Node K8s Pod K8s Pod Service Service Socket Socket Socket Socket Socket Socket TCP/IP TCP/IP TCP/IP TCP/IP TCP/IP TCP/IP IPtables IPtables IPtables IPtables IPtables IPtables Ethernet Ethernet Ethernet Ethernet Ethernet Ethernet Loopback Loopback eth0 eth0 Network 36 Istio with cilium and sockmap K8s Node K8s Pod K8s Pod Service Service Socket Socket Socket Socket Socket Socket Cilium CNI Cilium CNI TCP/IP TCP/IP Ethernet Ethernet eth0 eth0 Network 37 Istio Istio K8s cluster Pilot/Mixer/Citadel K8s node K8s node K8s pod K8s pod K8s pod Service A Service B Service C Cilium Networking CNI 38 Istio - Mutual TLS Istio K8s cluster Pilot/Mixer/Citadel K8s node K8s node K8s pod K8s pod Service A Service B Mutual TLS Cilium Networking CNI 39 Istio - Deferred kTLS Istio K8s cluster Pilot/Mixer/Citadel K8s node K8s node K8s pod K8s pod Service A Service B Deferred kTLS External Github encryption External Cloud Service Network Cilium Networking CNI 40 Kubernetes Services BPF, Cilium Iptables, kube-proxy ● Hash table.