Container Networking Deep Dive

Ivan Kovačević, Technical Leader

BRKCLD-3181 Cisco Webex Teams

Questions? Use Cisco Webex Teams to chat with the speaker after the session How 1 Find this session in the Cisco Events Mobile App 2 Click “Join the Discussion” 3 Install Webex Teams or go directly to the team space 4 Enter messages/questions in the team space

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 The Box

What is in the box? What is NOT in the box?

• A very heavy technical • basics session • Basic networking • In-depth info and concepts analysis • Container philosophy • CLI, logs, outputs… • / 101 • Reference and refresher slides

• Kuber-what?! Learn about Kubernetes - DEVNET-1999 • Deploying Kubernetes in the Enterprise with Cisco ACI - BRKACI-2505 • Kubernetes made easy with Cisco Container Platform (CCP) - LABCLD-2099 • Cisco Container Platform for Infrastructure Teams - BRKCLD-2005

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 Agenda

• Intro • CNI

• Docker Networking • Calico

• iptables • ACI CNI

• Kubernetes Networking • Open vSwitch

• Kube-proxy • OpenFlow

• ClusterIP and NodePort • Contiv-VPP implementation • Key Takeaways • Ingress Controller

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Changing landscape From Monolith to Microservices

Baremetal/VMs Containers

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 Containers Refresher

• Lighter form of virtualisation based on (Linux) isolation features: • Abstraction layer required for • VMs, but not for Containers • namespaces (pid, network…)

• Apps think they are alone on the server

• Easier app packaging via Container images

Virtual Machine Container

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 7 What does this mean for networking?

• In traditional networking services used to run on dedicated appliances – this is all now moving inside the nodes

• Running microservices means A LOT of east-west traffic – a lot more action is happening inside the nodes as well

• Linux networking stack and virtual switches are essential

• Nothing is static – containers spin up and down all the time

• The number of end-points in a container environment is massive How to deal with all this???

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 Docker Networking Docker Network Modes Refresher

• Bridge: default network driver. Containers are connected to a Docker0 Linux Bridge. Used by Kubernetes • Host: containers use the host networking directly (i.e. share the host network namespace)

• None: container not connected, or connectivity is provided by a 3rd party

• Container: borrow connectivity from another container

• MacVLAN: allows you to assign a MAC address to a container, making it appear as a physical device on your network. The Docker daemon routes traffic to containers by their MAC addresses

• Overlay and Third-party Docker network plugins.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 Basic Docker Networking with Bridge Mode

eth0 • By default Docker configures 10.100.0.2 172.17.0.1/16 on the docker0 Linux bridge. docker0 172.17.0.1 • When a container is created, the runtime engine will connect its virtual Ethernet interface to the docker0 veth0 172.17.0.2 bridge.

• An IP Address is assigned to the container, and docker0 Bridge is the container 1 default gateway.

# docker inspect 925984cfda27 | grep NetworkMode Host "NetworkMode": "default",

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 Basic Docker Networking with Bridge Mode

eth0 • Containers on the same host 10.100.0.2 communicate within the same subnet via docker0 Linux Bridge. docker0 172.17.0.1 • External communication uses the host TCP/IP stack and Docker automatically configures NAT. veth0 veth0 172.17.0.2 172.17.0.3

• Containers can also consume Host ports to expose services from the container. container 1 container 2

Host

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 Bridge mode: what is connected to what? eth0 10.100.0.2

# docker inspect 925984cfda27 | grep Pid.: "Pid": 5079, docker0 172.17.0.1 # docker inspect 5d1fe8d6c14f | grep Pid.: "Pid": 1617, veth0 veth0 172.17.0.2 172.17.0.3 # nsenter -t 5079 -n ip add show eth0 | grep eth0@ 59: eth0@if60: mtu 1500 qdisc noqueue state UP group default container 1 container 2 # nsenter -t 1617 -n ip add show eth0 | grep eth0@ 53: eth0@if54: mtu 1500 qdisc noqueue state UP group default Host

# ip add | egrep "60|54" 54: vethba1ecf2@if53: mtu 1500 qdisc noqueue master docker0 state UP group default 60: veth944dfc0@if59: mtu 1500 qdisc noqueue master docker0 state UP group default

# brctl show docker0 bridge name bridge id STP enabled interfaces docker0 8000.02424bcc9785 no veth944dfc0 vethba1ecf2 eth0

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 Containers can also share the same namespace

eth0 • It is also possible for two or more 10.100.0.2 containers to share the same net namespace. docker0 172.17.0.1 • Allows to run apps that require to talk to other services via localhost. veth0 172.17.0.2

# docker inspect 925984cfda27 | grep NetworkMode "NetworkMode": "default", container 1 container 2 # docker inspect 7357b02dcb40 | grep NetworkMode "NetworkMode": "container: 925984cfda27… Host

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 Basic Docker Networking with Host mode

eth0 • Containerized process has no network 10.100.0.2 namespace

• Sees the same network interface as a non-containerized process

• Is the same as running the process on the host directly from a networking point of view.

container 1 # docker inspect cd786f17acab | grep NetworkMode "NetworkMode": "host", Host

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 Exposing a port

$ docker run -d -p 8080:8080 containers.cisco.com/ikovacev/hellocl:v1.0 3241ad886ebd6fc348bbc2f64f24281635b7a162900a42370bcf6831a19f51b2

$ curl localhost:8080 {"message":"Hello CiscoLive!","platform":"Linux","release":"4.9.125- linuxkit","hostName":"3241ad886ebd"}

• Docker runtime pushes rules to local host iptables to achieve PAT • Linux kernel netfilter then does all the work in the data-path

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 iptables iptables Refresher • A user-space app to configure Linux-kernel firewall

• Tables: There are four built-in tables. Each contains several chains • Filter table (Default): For packet filtering • NAT table: For network address translation • Mangle table: For packet manipulation • Raw table: For configuration exemptions

• Chains are a list of rules • Default: PREROUTING, INPUT, OUTPUT, FORWARD, POSTROUTING (not all tables have all) • Can be stitched together (i.e. daisy-chained) to enforce complex logics

• Rules: Each rule specifies the matching criteria of an IP packet and the action it should take • Match: source/destination IP, source/destination port, connection state, etc. • Actions: ACCEPT, REJECT, DROP, MARK, DNAT, SNAT, LOG, etc.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 For your iptables flow diagram reference

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 Displaying and understanding iptables # iptables -nvL Chain INPUT (policy ACCEPT 1990K packets, 419M bytes) Default pkts bytes target prot opt in out source destination

policy Chain FORWARD (policy ACCEPT 161M packets, 93G bytes) pkts bytes target prot opt in out source destination 161M 93G DOCKER-USER all -- * * 0.0.0.0/0 0.0.0.0/0 161M 93G DOCKER-ISOLATION-STAGE-1 all -- * * 0.0.0.0/0 0.0.0.0/0 Default 71896 175M ACCEPT all -- * docker0 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED chains 0 0 DOCKER all -- * docker0 0.0.0.0/0 0.0.0.0/0 36954 1983K ACCEPT all -- docker0 !docker0 0.0.0.0/0 0.0.0.0/0 Packet 0 0 ACCEPT all -- docker0 docker0 0.0.0.0/0 0.0.0.0/0 hits Chain OUTPUT (policy ACCEPT 717K packets, 153M bytes) pkts bytes target prot opt in out source destination Matching Target aka. Chain DOCKER (1 references) condition pkts bytes target prot opt in out source destination action Chain DOCKER-ISOLATION-STAGE-1 (1 references) pkts bytes target prot opt in out source destination 36954 1983K DOCKER-ISOLATION-STAGE-2 all -- docker0 !docker0 0.0.0.0/0 0.0.0.0/0 Configured 161M 93G RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 chains Chain DOCKER-ISOLATION-STAGE-2 (1 references) pkts bytes target prot opt in out source destination 0 0 DROP all -- * docker0 0.0.0.0/0 0.0.0.0/0 36954 1983K RETURN all -- * * 0.0.0.0/0 0.0.0.0/0 ...

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 Understanding what is iptables doing with packets (container -> external)

• Can get quite messy as chains can be referenced and nested -> iptables spaghetti

• A very rudimentary way is resetting the packet counters and see which lines get hit: # iptables -Z # nsenter -t 5079 -n ping -f - 1234 10.48.37.151 PING 10.48.37.151 (10.48.37.151) 56(84) bytes of data. "Yes, but this --- 10.48.37.151 ping statistics --- 1234 packets transmitted, 1234 received, 0% packet loss, time 228ms is a controlled rtt min/avg/max/mdev = 0.099/0.166/8.104/0.231 ms, ipg/ewma 0.185/0.181 ms # iptables -nvL | egrep "Chain|pkts|^$|1234" environment…" Chain INPUT (policy ACCEPT 93 packets, 6308 bytes) pkts bytes target prot opt in out source destination Chain FORWARD (policy ACCEPT 1717 packets, 456K bytes) pkts bytes target prot opt in out source destination 1234 104K ACCEPT all -- * docker0 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED 1234 104K ACCEPT all -- docker0 !docker0 0.0.0.0/0 0.0.0.0/0 Chain DOCKER-ISOLATION-STAGE-1 (1 references) pkts bytes target prot opt in out source destination 1234 104K DOCKER-ISOLATION-STAGE-2 all -- docker0 !docker0 0.0.0.0/0 0.0.0.0/0 …

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 iptables TRACE to the rescue! • iptables trace lists chains and rules that were matched for a specific packet

• To enable it we need to: • Load the "nf_log_ipv4" kernel module • Set the kernel parameter "net.netfilter.nf_log.2=nf_log_ipv4" • Set the trace iptables rule where we want to start tracing (usually in the raw table)

• Now just start dumping journalctl to a file and generate some traffic

• Here is an example:

# modprobe nf_log_ipv4 # sysctl net.netfilter.nf_log.2=nf_log_ipv4 net.netfilter.nf_log.2 = nf_log_ipv4

# nsenter -t 5079 -n ip add show eth0 | grep inet inet 172.17.0.3/16 brd 172.17.255.255 scope global eth0 # iptables -t raw -j TRACE -s 172.17.0.3 -p tcp --dport 80 --tcp-flags SYN SYN -I PREROUTING 1 # journalctl -f > trace & # nsenter -t 5079 -n curl -s http://10.48.37.151/ > /dev/null

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 For your iptables TRACE log line anatomy reference type - can be: • "rule" if specific rule was applied • "policy" if no rules were matched and default policy was applied in a default chain • "return" if no rule was matched in other chains

Nov 20 02:06:09 docker-host kernel: TRACE: raw:PREROUTING:policy:2 IN=docker0 OUT=

timestamp hostname table chain rule number ingress and egress interfaces MAC=02:42:4b:cc:97:85:02:42:ac:11:00:03:08:00 SRC=172.17.0.3 DST=10.48.37.151

DST MAC SRC MAC ethertype SRC and DST IPs LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=44463 DF PROTO=TCP

IP header options SPT=38662 DPT=80 SEQ=1535265152 ACK=0 WINDOW=29200 RES=0x00 SYN URGP=0 OPT

SRC and DST port TCP flags and options

* In the subsequent trace outputs timestamp, hostname MAC field and IP options will be filtered out and output will be tabulated for easier reading

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 23 Analyzing iptables TRACE output (cont. -> ext.) # grep TRACE trace | sed 's/IN=/\; IN=/' | column -t -s $';' | sed 's/.*TRACE/TRACE/' | sed -E 's/MAC=.{42}//' | sed 's/LEN=.*DF //' | cut -c -125 TRACE: raw:PREROUTING:policy:2 IN=docker0 OUT= SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 DPT=80 TRACE: mangle:PREROUTING:policy:1 IN=docker0 OUT= SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 DPT=80 TRACE: nat:PREROUTING:policy:3 IN=docker0 OUT= SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 DPT=80 TRACE: mangle:FORWARD:policy:1 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: filter:FORWARD:rule:1 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: filter:DOCKER-USER:return:1 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: filter:FORWARD:rule:2 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: filter:DOCKER-ISOLATION-STAGE-1:rule:1 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: filter:DOCKER-ISOLATION-STAGE-2:return:2 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: filter:DOCKER-ISOLATION-STAGE-1:return:2 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: filter:FORWARD:rule:5 IN=docker0 OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 TRACE: mangle:POSTROUTING:policy:1 IN= OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 DPT=80 TRACE: nat:POSTROUTING:rule:1 IN= OUT=ens160 SRC=172.17.0.3 DST=10.48.37.151 PROTO=TCP SPT=38662 DPT=80

# iptables -nvL FORWARD -t filter --line-numbers # tcpdump -ni ens160 host 10.48.37.151 Chain FORWARD (policy ACCEPT 3600K packets, 1440M bytes) 19:57:25.833611 IP 10.48.31.158.38612 > 10.48.37.151.http: Flags [S]… num pkts bytes target prot opt in out source destination 1 3603K 1440M DOCKER-USER all -- * * 0.0.0.0/0 0.0.0.0/0 2 3603K 1440M DOCKER-ISOLATION-STAGE-1 all -- * * 0.0.0.0/0 0.0.0.0/0 3 1331 208K ACCEPT all -- * docker0 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED 4 0 0 DOCKER all -- * docker0 0.0.0.0/0 0.0.0.0/0 5 1331 110K ACCEPT all -- docker0 !docker0 0.0.0.0/0 0.0.0.0/0

# iptables -nvL POSTROUTING -t nat --line-numbers Chain POSTROUTING (policy ACCEPT 35884 packets, 2160K bytes) num pkts bytes target prot opt in out source destination 1 71 4504 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0 2 2059K 126M MASQUERADE all -- * ens160 0.0.0.0/0 0.0.0.0/0

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 …back to exposing a port in Docker

# docker run -d -p 8080:8080 containers.cisco.com/ikovacev/hellocl:v1.0 e8e6ad133e79bf90f75e541719abd93ceac409228ce47ce1733068a74311c925

Docker pushes NAT rules to iptables

# iptables -nvL -t nat --line-numbers Chain PREROUTING (policy ACCEPT 411 packets, 32926 bytes) num pkts bytes target prot opt in out source destination 1 9422 579K DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:3389 to:192.168.1.2:3389 2 352K 83M DOCKER all -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 107 packets, 6432 bytes) num pkts bytes target prot opt in out source destination 1 71 4504 MASQUERADE all -- * !docker0 172.17.0.0/16 0.0.0.0/0 TIP: grep for the 2 2063K 126M MASQUERADE all -- * ens160 0.0.0.0/0 0.0.0.0/0 3 0 0 MASQUERADE tcp -- * * 172.17.0.4 172.17.0.4 tcp dpt:8080 port# to quickly verify iptables

Chain DOCKER (1 references) config num pkts bytes target prot opt in out source destination 1 0 0 RETURN all -- docker0 * 0.0.0.0/0 0.0.0.0/0 2 3 192 DNAT tcp -- !docker0 * 0.0.0.0/0 0.0.0.0/0 tcp dpt:8080 to:172.17.0.4:8080

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 iptables trace of a packet arriving to the exposed port IP SRC = 144.254.9.156 IP DST = 10.48.31.158 PORT DST = 8080 iface = ens160

IP SRC = 144.254.9.156 IP DST = 172.17.0.4 PORT DST = 8080 iface = docker0

TRACE: raw:PREROUTING:policy:2 IN=ens160 OUT= SRC=144.254.9.156 DST=10.48.31.158 PROTO=TCP SPT=60330 DPT=808 TRACE: mangle:PREROUTING:policy:1 IN=ens160 OUT= SRC=144.254.9.156 DST=10.48.31.158 PROTO=TCP SPT=60330 DPT=808 TRACE: nat:PREROUTING:rule:2 IN=ens160 OUT= SRC=144.254.9.156 DST=10.48.31.158 PROTO=TCP SPT=60330 DPT=808 TRACE: nat:DOCKER:rule:2 IN=ens160 OUT= SRC=144.254.9.156 DST=10.48.31.158 PROTO=TCP SPT=60330 DPT=808 TRACE: mangle:FORWARD:policy:1 IN=ens160 OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DP TRACE: filter:FORWARD:rule:1 IN=ens160 OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DP TRACE: filter:DOCKER-USER:return:1 IN=ens160 OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DP TRACE: filter:FORWARD:rule:2 IN=ens160 OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DP TRACE: filter:DOCKER-ISOLATION-STAGE-1:return:2 IN=ens160 OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DP TRACE: filter:FORWARD:rule:4 IN=ens160 OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DP TRACE: filter:DOCKER:rule:1 IN=ens160 OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DP TRACE: mangle:POSTROUTING:policy:1 IN= OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DPT=8080 TRACE: nat:POSTROUTING:policy:4 IN= OUT=docker0 SRC=144.254.9.156 DST=172.17.0.4 PROTO=TCP SPT=60330 DPT=8080

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 Kubernetes Networking Key components

Kube-proxy CNI

• K8s native iptables/ipvs proxy • The Kubernetes networking • Implements K8s services by plugin programming rules on the local • Provides connectivity for pods cluster node and more • Runs as a pod on each node • Can take over K8s services (DaemonSet) implementation from kube-proxy • Runs as a binary on each node called by kubelet

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Kubernetes Pods = unit of deployment Refresher

• A Pod is a group of one or more containers with shared storage and network Network Namespace (pod vEth and IP address) • A pod models an application-specific Hostname (for the pod) “logical host” – an analogue to physical or cgroup cgroup . (cpu, mem) (cpu, mem) nginx confd • Containers within a pod share an IP address and port space, and can find each other via localhost A pod with two containers, ‘nginx’ and ’confd’.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 29 Kubernetes Service types Refresher

• There are currently four types of Kubernetes Services:

• ClusterIP – exposes a service internally, using a virtual IP address that is permanent but only visible in the cluster (for POD to POD conversations)

• NodePort – exposes a service externally by mapping the service port to a port on the Node IP.

• LoadBalancer – uses a cloud provider to expose a service externally via a dedicated external IP address (statically or dynamically assigned)

• ExternalName – used to reference endpoints that are external to the cluster.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 Kubernetes Service implementation

• ClusterIP and NodePort: • By default implemented by kube-proxy • However, some CNIs will take over ClusterIP • NodePort typically stays in iptables since traffic lands on the node IP

• LoadBalancer: • By default implemented outside of Kubernetes and provisioned by kube- cloud-controller • Deviations: In-cluster (quazi-)LoadBalancer (MetalLB) or bundled with CNI

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 ClusterIP – EastWest Load Balancing Refresher

• Kubernetes assigns a persistent Deployment nginx nginx nginx busybox virtual IP to the service (can be 10.2.0.171 10.2.0.172 10.2.0.5 dynamically or statically assigned) • PODs can communicate to the POD Subnet- 10.2.0.0/24 Linux Bridge ServiceIP service ClusterIP which in turn load Kube-proxy (iptables)

balances and NATs to the POD IP Docker 10.96.176.68 -> 10.2.0.171 addresses. -> 10.2.0.172 kubelet • By default this is implemented by Kubernetes Node kube-proxy using iptables or lately API IPVS Kube-apiserver

Master Node 1, 2, …N

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 32 Service ClusterIP becomes a distributed L4 load balancer

wget http://10.96.176.68 Kubernetes Node 01 Kubernetes Node 02

Deployment nginx busybox nginx nginx nginx nginx 10.2.0.5 10.2.0.171 10.2.0.172 10.2.1.42 10.2.1.41

10.96.176.68 -> 10.2.0.171 10.96.176.68 -> 10.2.0.171 -> 10.2.0.172 -> 10.2.0.172 -> 10.2.1.41 -> 10.2.1.41 -> 10.2.1.42 CNI virtual switch -> 10.2.1.42 CNI virtual switch

kube-proxy kube-proxy ens160 ens160 (10.128.0.3) (10.128.0.4)

10.128.0.0/16

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 33 Service ClusterIP becomes a distributed L4 load balancer

wget http://10.96.176.68 Kubernetes Node 01 Kubernetes Node 02

Deployment nginx busybox nginx nginx nginx nginx 10.2.0.5 10.2.0.171 10.2.0.172 10.2.1.42 10.2.1.41

10.96.176.68 -> 10.2.0.171 10.96.176.68 -> 10.2.0.171 -> 10.2.0.172 -> 10.2.0.172 -> 10.2.1.41 -> 10.2.1.41 -> 10.2.1.42 CNI virtual switch -> 10.2.1.42 CNI virtual switch

kube-proxy kube-proxy ens160 ens160 (10.128.0.3) (10.128.0.4) iptables rules will DNAT to one of the endpoint IPs

10.128.0.0/16

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 34 Service ClusterIP becomes a distributed L4 load balancer

wget http://10.96.176.68 Kubernetes Node 01 Kubernetes Node 02

Deployment nginx busybox nginx nginx nginx nginx 10.2.0.5 10.2.0.171 10.2.0.172 10.2.1.42 10.2.1.41

10.96.176.68 -> 10.2.0.171 10.96.176.68 -> 10.2.0.171 -> 10.2.0.172 -> 10.2.0.172 -> 10.2.1.41 -> 10.2.1.41 -> 10.2.1.42 CNI virtual switch -> 10.2.1.42 CNI virtual switch

kube-proxy kube-proxy ens160 ens160 (10.128.0.3) (10.128.0.4) iptables rules will DNAT to one of the endpoint IPs

10.128.0.0/16

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 35 Obviously, we need to deal with iptables again • Let's look up a ClusterIP service:

$ kubectl get svc nginx NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nginx ClusterIP 10.96.176.68 8080/TCP 79s

$ kubectl get ep nginx NAME ENDPOINTS AGE nginx 10.2.1.41:80,10.2.1.42:80,10.2.1.171:80 + 1 more... 90s

# iptables -t nat -nvL | grep 10.96.176.68 0 0 KUBE-MARK-MASQ tcp -- * * !10.2.0.0/16 10.96.176.68 /* default/nginx: cluster IP */ tcp dpt:8080 0 0 KUBE-SVC-TWVLBX4WCEZSIVWL tcp -- * * 0.0.0.0/0 10.96.176.68 /* default/nginx: cluster IP */ tcp dpt:8080

# iptables -t nat -nvL KUBE-SVC-TWVLBX4WCEZSIVWL Chain KUBE-SVC-TWVLBX4WCEZSIVWL (1 references) pkts bytes target prot opt in out source destination 0 0 KUBE-SEP-K2D2A2RTG5KDZHX4 all -- * * 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.25000000000 0 0 KUBE-SEP-GC2Q6KSNR3UXZM5O all -- * * 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.33332999982 0 0 KUBE-SEP-YY4FVRTVVWKZYPQ3 all -- * * 0.0.0.0/0 0.0.0.0/0 statistic mode random probability 0.50000000000 0 0 KUBE-SEP-CBG2GDZ4D544KBTQ all -- * * 0.0.0.0/0 0.0.0.0/0

# iptables -t nat -nvL KUBE-SEP-K2D2A2RTG5KDZHX4 iptables load-balancing! Chain KUBE-SEP-K2D2A2RTG5KDZHX4 (1 references) pkts bytes target prot opt in out source destination 0 0 KUBE-MARK-MASQ all -- * * 10.12.3.10 0.0.0.0/0 0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp to:10.2.1.41:80

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 36 Verifying ClusterIP packet processing • The node that runs the pod accessing the ClusterIP is doing all the heavy lifting

wget http://10.96.176.68 Kubernetes Node 01 Kubernetes Node 02

• Two distinct cases: Deployment nginx busybox nginx nginx nginx nginx 10.2.0.5 10.2.0.171 10.2.0.172 10.2.1.42 10.2.1.41 • Source and Destination pods are on the same node 10.96.176.68 10.96.176.68 -> 10.2.0.171 -> 10.2.0.171 -> 10.2.0.172 -> 10.2.0.172 CNI virtual switch CNI virtual switch • Source and Destination pods are on different nodes: -> 10.2.1.41 -> 10.2.1.41 -> 10.2.1.42 -> 10.2.1.42 # iptables -t raw -j TRACE -d 10.96.176.68 -p tcp --dport 8080 --tcp-flags SYN SYN -I PREROUTING 1 ens160 ens160 # journalctl -f > trace & # grep TRACE trace | sed 's/IN=/\; IN=/' | column -t -s $';' | sed 's/.*TRACE/TRACE/' | sed -E 's/MAC=.{42}//' | sed 's/LEN=.*DF //' | grep -v cali- | cut -c -140 TRACE: raw:PREROUTING:rule:2 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: raw:PREROUTING:policy:3 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: mangle:PREROUTING:rule:1 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: mangle:PREROUTING:policy:2 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: nat:PREROUTING:rule:1 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: nat:PREROUTING:rule:2 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: nat:KUBE-SERVICES:rule:14 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: nat:KUBE-SVC-TWVLBX4WCEZSIVWL:rule:1 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: nat:KUBE-SEP-K2D2A2RTG5KDZHX4:rule:2 IN=cali6867d3c3750 OUT= SRC=10.2.0.5 DST=10.96.176.68 PROTO=TCP SPT=40185 DPT=8080 SEQ= TRACE: mangle:FORWARD:policy:1 IN=cali6867d3c3750 OUT=tunl0 SRC=10.2.0.5 DST=10.2.1.41 PROTO=TCP SPT=40185 DPT=80 SEQ= TRACE: filter:FORWARD:rule:1 IN=cali6867d3c3750 OUT=tunl0 SRC=10.2.0.5 DST=10.2.1.41 PROTO=TCP SPT=40185 DPT=80 SEQ= TRACE: mangle:POSTROUTING:policy:1 IN= OUT=tunl0 SRC=10.2.0.5 DST=10.2.1.41 PROTO=TCP SPT=40185 DPT=80 SEQ=1078874980 ACK= TRACE: nat:POSTROUTING:rule:1 IN= OUT=tunl0 SRC=10.2.0.5 DST=10.2.1.41 PROTO=TCP SPT=40185 DPT=80 SEQ=1078874980 ACK= TRACE: nat:POSTROUTING:rule:2 IN= OUT=tunl0 SRC=10.2.0.5 DST=10.2.1.41 PROTO=TCP SPT=40185 DPT=80 SEQ=1078874980 ACK= TRACE: nat:KUBE-POSTROUTING:return:2 IN= OUT=tunl0 SRC=10.2.0.5 DST=10.2.1.41 PROTO=TCP SPT=40185 DPT=80 SEQ=1078874980 ACK= TRACE: nat:POSTROUTING:policy:4 IN= OUT=tunl0 SRC=10.2.0.5 DST=10.2.1.41 PROTO=TCP SPT=40185 DPT=80 SEQ=1078874980 ACK=

# iptables -t nat -nvL POSTROUTING | grep Chain Chain POSTROUTING (policy ACCEPT 9 packets, 540 bytes)

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 37 NodePort? Practically the same thing! Kubernetes Node 01 Kubernetes Node 02

Deployment nginx • All the action is on the node where traffic lands first nginx nginx nginx nginx 10.2.0.171 10.2.0.172 10.2.1.42 10.2.1.41 $ kubectl get svc nginx 10.96.176.68 10.96.176.68 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE -> 10.2.0.171 -> 10.2.0.171 -> 10.2.0.172 -> 10.2.0.172 CNI virtual switch CNI virtual switch nginx NodePort 10.96.176.220 80:31140/TCP 4m56s -> 10.2.1.41 -> 10.2.1.41 -> 10.2.1.42 -> 10.2.1.42

ens160 ens160 # iptables -t nat -nvL | grep 31140 0 0 KUBE-MARK-MASQ tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx: */ tcp dpt:31140 0 0 KUBE-SVC-TFDSBUMN6I7R7DUY tcp -- * * 0.0.0.0/0 0.0.0.0/0 /* default/nginx: */ tcp dpt:31140

# iptables -t raw -j TRACE -p tcp --dport 31140 --tcp-flags SYN SYN -I PREROUTING 1

TRACE: nat:KUBE-SERVICES:rule:26 IN=ens192 OUT= SRC=192.168.1.1 DST=10.128.0.3 PROTO=TCP SPT=41618 DPT=31140 SEQ=24907083 TRACE: nat:KUBE-NODEPORTS:rule:3 IN=ens192 OUT= SRC=192.168.1.1 DST=10.128.0.3 PROTO=TCP SPT=41618 DPT=31140 SEQ=24907083 TRACE: nat:KUBE-MARK-MASQ:rule:1 IN=ens192 OUT= SRC=192.168.1.1 DST=10.128.0.3 PROTO=TCP SPT=41618 DPT=31140 SEQ=24907083 TRACE: nat:KUBE-MARK-MASQ:return:2 IN=ens192 OUT= SRC=192.168.1.1 DST=10.128.0.3 PROTO=TCP SPT=41618 DPT=31140 SEQ=24907083 TRACE: nat:KUBE-NODEPORTS:rule:4 IN=ens192 OUT= SRC=192.168.1.1 DST=10.128.0.3 PROTO=TCP SPT=41618 DPT=31140 SEQ=24907083 TRACE: nat:KUBE-SVC-TFDSBUMN6I7R7DUY:rule:4 IN=ens192 OUT= SRC=192.168.1.1 DST=10.128.0.3 PROTO=TCP SPT=41618 DPT=31140 SEQ=24907083 TRACE: nat:KUBE-SEP-SQNRY57BXQNVGBT4:rule:2 IN=ens192 OUT= SRC=192.168.1.1 DST=10.128.0.3 PROTO=TCP SPT=41618 DPT=31140 SEQ=24907083 TRACE: mangle:FORWARD:policy:1 IN=ens192 OUT=tunl0 SRC=192.168.1.1 DST=10.2.1.41 PROTO=TCP SPT=41618 DPT=80 SEQ=2490708 ... TRACE: nat:POSTROUTING:rule:1 IN= OUT=tunl0 SRC=192.168.1.1 DST=10.2.1.41 PROTO=TCP SPT=41618 DPT=80 SEQ=2490708399 AC TRACE: nat:POSTROUTING:rule:2 IN= OUT=tunl0 SRC=192.168.1.1 DST=10.2.1.41 PROTO=TCP SPT=41618 DPT=80 SEQ=2490708399 AC TRACE: nat:KUBE-POSTROUTING:rule:1 IN= OUT=tunl0 SRC=192.168.1.1 DST=10.2.1.41 PROTO=TCP SPT=41618 DPT=80 SEQ=2490708399 AC # ip add show tunl0 | grep inet inet 10.2.26.128/32 brd 10.2.26.128 scope global tunl0 # iptables -t nat -nvL KUBE-POSTROUTING # tcpdump -ennali tunl0 Chain KUBE-POSTROUTING (1 references) 15:17:07.758001 ip: 10.2.26.128.41618 > 10.2.2.41.80: Flags [S], pkts bytes target prot opt in out source destination 0 0 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 38 LoadBalancer

• The action is mostly outside of the Kubernetes cluster

• Main purpose is to evenly distribute traffic between cluster nodes

• Once the traffic lands, it is the same story as with ClusterIP but this time with externally routable IPs

• There are many different solutions and implementations, each has it own inner workings and specifics

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 39 IPVS Considerations

• IP Virtual Server implements transport-layer load balancing as part of the Linux kernel

• Built on top of netfilter

• Used by kube-proxy to replace iptables load-balancing

• Solves scaling issues

• Does not completely get rid of iptables!

• Not that much adoption

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 40 Ingress Controller Refresher

• LoadBalancers burn a lot of IPs (one per exposed object)

• An ingress controller is another method of exposing a cluster service to the outside world that will not require a Load Balancer for each service.

• These are L7 HTTP(S) load balancers that can offer also SSL termination, name- based virtual hosting, etc.

• The Ingress Controller is deployed as a POD that can run on one or more nodes on the cluster

• Orchestrated and configured via Kubernetes standard API object "ingress"

• There are multiple options: NGINX, HAProxy, Envoy, Traefik, etc.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 41 NGINX Ingress Controller

• Based on NGINX webserver/proxy/load-balancer

• Two parts: IC + NGINX (same pod/container)

• Ingress Controller (control-plane): • Monitors for additions and changes of ingress objects from the kube-apiserver • Generates a config file for the NGINX based on the ingress object config and state • Manages the NGINX process

• NGINX (data-plane): • Does all the work by rewriting HTTP headers, etc. • Can also terminate SSL

• Needs to be exposed via LoadBalancer or NodePort

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 42 NGINX Ingress topology

http://guestbook.com https://cafe.test.com/tea https://cafe.test.com/coffee

Virtual IP Address (VIP / External IP)

Container Load Balancing As a Service - K8s + NGINX Ingress Controller

Guestbook app / service tea-svc coffee-svc Guest Guest Guest Tea Tea Coffee Coffee Container Redis Master Redis Slave networking

Persistent storage

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 43 Verifying NGNIX Ingress operation • Compare the ingress manifests with NGINX config inside the pod:

$ kubectl describe ing kaas-dashboard $ kubectl exec -ti nginx-ingress-controller-6vjhl -- cat /etc/nginx/nginx.conf Name: kaas-dashboard ... Namespace: default location ~* "^/" { Address: 192.168.1.13,192.168.1.14,192.168.1.15 set $namespace "default"; Default backend: default-http-backend:80 () set $ingress_name "kaas-dashboard"; TLS: set $service_name ""; kaas-dashboard terminates kaas-dashboard set $service_port "8000"; Rules: set $location_path "/"; Host Path Backends rewrite_by_lua_block { ------lua_ingress.rewrite({ * force_ssl_redirect = true, / kaas-dashboard:8000 (10.11.1.13:8000) use_port_in_redirects = false, ......

• The rest is identical to any app running on top of Kubernetes: • Capture traffic in/out of pod using tcpdump • Check pod health using kubectl describe pod • Check application logs using kubectl logs : 10.11.42.128 - [10.11.42.128] - - [17/Dec/2019:11:49:35 +0000] "GET /static/LoginView-2829a085157d1c8a4ca9.js HTTP/2.0" 200 1055 "https://192.168.1.11/auth/login" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" 47 0.332 [default-kaas-dashboard-8000] [] 10.11.1.13:8000 2129 0.332 200 ef44723a7a0e582a26a2edbc531adb7d I0105 23:27:01.334125 6 store.go:449] secret default/ccp-ingress was updated and it is used in ingress annotations. Parsing… I0105 23:27:01.594068 6 backend_ssl.go:58] Updating Secret "default/ccp-ingress" in the local store

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 44 CNIs CNI: solving the Kubernetes Network puzzle

• Kubernetes makes individual Docker nodes come together

• CNI makes them play together nicely

• The main issue that CNI solves is how to bridge the networking gap between Docker nodes

• All CNIs do it differently, but all of them play with the following functionality: • Transporting the packets between the node, usually via some sort of overlay • Employ a data-path • Knowing which pod is running where • Handling IPAM • Implementing a layer of security within the cluster (optional)

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 46 For your CNI Functions and Flow reference

kubelet CNI commands and network config

Config CNI plugin Passes same parameters and network config

Configures Network connectivity on the local cluster node Assigns IP IPAM plugin addresses

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 47 CNIs that will be covered in this session

Calico ACI CNI Contiv-VPP

All 3 CNIs are available within Cisco Container Platform (CCP)!

Native Kubernetes (100% Upstream) Direct updates and best practices from open source community

Hybrid Cloud Optimized E.g.: Google, AWS, …

Integrated Networking | Management | Security | Analytics

Turnkey Solution Flexible Deployment Model For Production-Grade Container VM | Bare metal → HX, UCS, ACI | Public cloud Environments

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 48 Calico Calico CNI properties

https://docs.projectcalico.org/v3.10/getting-started/kubernetes/ • One of the most popular CNIs

• Leverages Linux’s built-in efficient network stack for data-path • Linux IP routing • iptables (for security policy)

• Implements lightweight IP-in-IP tunneling between the nodes

• Uses BGP daemons in the control-plane for distributing pod reachability info

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 50 Calico Architecture

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 51 Calico-node pod

3 components:

• Felix • Programs routes and iptables on the host to provide desired connectivity • Programs interface information to the kernel for outgoing endpoint traffic • Instructs the host to respond to ARPs for workloads with the MAC address of the host.

• BIRD • BGP client that is used to exchange routing information between hosts

• confd • for managing and applying BIRD config

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 52 Calico data plane plumbing node2 podC / # curl http://192.168.215.65 alpine podA podB pause nginx redis dnsmasq kubedns Welcome to nginx! eth0

pause pause default via 10.48.20.1 dev ens160 onlink 192.168.210.137/32 10.48.20.0/24 dev ens160 proto kernel scope link src 10.48.20.12 eth0 eth0 blackhole 192.168.210.128/26 proto bird 192.168.210.137 dev cali4034a2c6434 scope link cali403… 192.168.215.65/32 192.168.215.66/32 192.168.215.64/26 via 10.48.20.11 dev tunl0 proto bird onlink veth veth Kernel cali5b3... defaultcali6c4… via 10.48.20.1 dev ens160 onlink 10.48.20.0/24 dev ens160 proto kernel scope link src 10.48.20.11 routing POD CIDR: POD CIDR: Kernel 192.168.210.128/26 via 10.48.20.12 dev tunl0 proto bird onlink 192.168.210.128/26 192.168.215.64/26 routing blackhole 192.168.215.64/26 proto bird 192.168.215.65 dev cali5b3c0df79aa scope link 192.168.215.66 dev cali6c4b8fa43bd scope link tunl0 192.168.210.128/32 ip-in-ip tunl0 192.168.215.64/32 node1 ens160 10.48.20.12/24 eth ens160 10.48.20.11/24 10.48.20.12 > 10.48.20.11: 192.168.210.137.44227 > 192.168.215.65.80 10.48.20.11 > 10.48.20.12: 192.168.215.65.80 > 192.168.210.137.44227 10.48.20.12 > 10.48.20.11: 192.168.210.137.47696 > 192.168.215.65.80 10.48.20.0/24

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 53 Verifying Calico operation • Again, treat it like a K8s app: check the pod health, events and logs

• Since this is a CNI checking kubelet logs is useful

• iptables trace is applicable for data-path

• Check node ip routing table for reachability info

• Check BIRD client for BGP control-plane state $ ip route get 192.168.0.3 192.168.0.3 dev cali1cbef5c1a2e src 10.37.7.243 uid 1001 $ ip route get 192.168.1.7 Local pod 192.168.1.7 via 10.37.7.244 dev tunl0 src 192.168.68.128 uid 1001 $ kubectl exec -n kube-system -ti calico-node-h85bb -- bash root@podx-tenant1-master5e88b8a07c:/# birdcl -s /var/run/calico/bird.ctl BIRD v0.3.3+birdv1.6.3 ready. Remote pod bird> show route 192.168.0.3/32 192.168.0.3/32 dev cali1cbef5c1a2e [kernel1 2019-12-17] * (10) bird> show route 192.168.1.7/32 192.168.1.7/32 via 10.37.7.244 on ens192 [Mesh_10_37_7_244 2019-12-17] * (100/0) [i] bird> show protocols all Mesh_10_37_7_244 name proto table state since info Mesh_10_37_7_244 BGP master up 2019-12-17 Established Description: Connection to BGP peer ...

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 ACI CNI ACI CNI Solution Overview

Network Policy Kubernetes Technical Description

• Implements ACI policy in Container Networking • Open vSwitch used in the data-path

ACI Policies • Managed by ACI using OpFlex and OpenFlow 1.3 protocols • Behaves like a virtual leaf • Embedded fabric and virtual switch load balancing • PBR in fabric for external service load balancing • OVS used for internal service load balancing • SNAT – friendlier to Firewalls • VMM Domain for Kubernetes • Stats per namespace, deployment, service, pod OpFlex OVS OpFlex OVS Node Node • Physical to container correlation

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 56 ACI CNI Plugin for Kubernetes Native Security Policy Support

Container Team Network Administrator

1 Install Kubernetes and ACI plugin 1 Fabric bring up Build service definitions 2 Create Kubernetes system and define network policy 2 resources in ACI Deploy and scale 3 (Optional) Create EPGs and clusters 3 contracts for use in Kubernetes 4 (Optional) Annotate OpFlex OVS deployments to move Monitor and observe 4 between EPGs Node network telemetry

ACI Fabric

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 57 Using Network Policy and EPGs

Cluster Isolation Namespace Isolation Deployment Isolation

Single EPG for entire cluster. Each namespace is mapped to Each deployment mapped to an EPG (Default behavior) its own EPG. Contracts tightly control service traffic No need for any internal contracts. Contracts for inter-namespace traffic.

Key Map EPG Network Policy Contract

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 58 Fabric Administrator has an inventory of Kubernetes objects – simplify operations

Fabric admin can search APIC for k8s nodes, masters, pods, services …

View pods per node, map to encapsulation, physical APIC keeps inventory of pods and point in the fabric. their metadata (labels, annotations), deployments, replicasets, etc.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 59 For your ACI CNI Components reference $ kubectl get pod -n kube-system -o wide | grep aci aci-containers-controller-84d5bb9bd7-zdc4n 2/2 Running 0 24d 10.10.1.14 ccp1-tenant1-0-master-0 aci-containers-host-7p6z2 3/3 Running 0 24d 10.10.1.12 ccp1-tenant1-1-node-gr-0 aci-containers-host-sq8f9 3/3 Running 0 24d 10.10.1.11 ccp1-tenant1-1-node-gr-1 aci-containers-host-zpwwv 3/3 Running 0 24d 10.10.1.14 ccp1-tenant1-0-master-0 aci-containers-openvswitch-dpll6 1/1 Running 1 24d 10.10.1.12 ccp1-tenant1-1-node-gr-0 aci-containers-openvswitch-g7pmx 1/1 Running 1 24d 10.10.1.14 ccp1-tenant1-0-master-0 aci-containers-openvswitch-kv5fx 1/1 Running 1 24d 10.10.1.11 ccp1-tenant1-1-node-gr-1

$ kubectl get pod -n kube-system aci-containers-controller-84d5bb9bd7-zdc4n -o=jsonpath='{range .spec.containers[*]}{.name}{"\n"}{end}' snat-operator aci-containers-controller

$ kubectl get pod -n kube-system aci-containers-host-7p6z2 -o=jsonpath='{range .spec.containers[*]}{.name}{"\n"}{end}' aci-containers-host opflex-agent mcast-daemon

• aci-containers-controller: handles IPAM, performs policy mapping, pushes config to APIC (load-balancing implementation, etc.) • snat-operator: performs tasks related to SNAT against K8s API • opflex-agent and aci-containers-host (aka. host-agent): see next slide • mcast-daemon: subscribes the node to multicast groups for BUM traffic • aci-containers-openvswitch: OVS in a container that implements the data-path – routing and switching between the pods, K8s services, enforces ACI and K8s policy

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 60 For your ACI CNI Architecture reference

host-agent container ACI Leaf • Receives updates from kube-apiserver about new kube-apiserver Switch opflex-proxy endpoints and services and writes it to files (via "kubernetes" svc)

Tenant traffic Endpoint (VLAN or VxLAN) Information EP/SVC files

OpFlex policy • Located in /var/lib/opflex-agent-ovs/endpoints/ (ACI infra VLAN) • Storage for host local endpoint and service information

OpenFlow host-agent opflex-agent Open vSwitch opflex-agent container • Reads EP/SVC files EP/SVC Pod1 Pod2 • Downloads ACI policy (EPGs, contracts) using OpFlex Files K8s node protocol from the ACI fabric via opflex-proxy • Programs Open vSwitch via OpenFlow

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 61 OVS data-path details

podX podY eth0 eth0

vethxxx vethyyy Network policy br-access enforcement

pa-vethxxx pa-vethyyy

pi-vethxxx pi-vethyyy NAT, LB, br-int routing and encapsulation br-int_vxlan0 K8s node bond

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 62 Open vSwitch Open vSwitch http://openvswitch.org • Open source implementation of a distributed virtual

• Main purpose is to provide a switching stack for hardware environments

• Merged into the Linux kernel mainline in kernel version 3.3

• Features: • VLANs with trunking, LACP, port-channels, STP, BFD • virtual partitions (bridges) on single data-plane (like N7k VDC) • QoS, traffic policing, NetFlow, SPAN, RSPAN • traffic tunneling via GRE, VXLAN, IPsec • kernel space and forwarding • OpenFlow

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 64 For your OVS utilities reference • ovs-vsctl: Open vSwtich configuration management

• ovs-ofctl: OpenFlow switch management utility

• ovs-appctl: Utility for configuring and querying Open vSwitch daemon

• ovs-dpctl: datapath management utility Show list of bridges and their interfaces # ovs-vsctl show Useful OVS commands bfd5a830-1591-4334-a7f8-117ba4b4b30f Bridge "br-eth1" ovs-vsctl show Port "eth1" ovs-ofctl show Interface "eth1" ovs-ofctl dump-flows [-OOpenFlow13] Port "vnet1" trunks: [3666] ovs-ofctl diff-flows [-OOpenFlow13] . Interface "vnet1" ovs-appctl ofproto/trace Port "br-eth1" ovs-appctl fdb/show Interface "br-eth1" ovs-dpctl show type: internal Port "vnet0" ovs-dpctl dump-flows --names tag: 3707 Interface "vnet0"

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 65 OVS packet forwarding

"config mgr" • The decision on how to process packets is made in userspace • The decision is based on learned MAC address table and/or programmed "supervisor" OpenFlow rules • First packet of new flow goes to ovs- vswitchd, following packets hit cached entry in kernel module

• OpenFlow rules can be programmed by "line card" an external controller

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 66 OpenFlow http://www.opennetworking.org

• Standardized protocol that gives access to the forwarding plane of a or router over the network

• Development and standardization managed by ONF

• Implemented as a list of rules in the OpenFlow table of a HW/SW switch

• Rules are placed in tables and packet processing starts from the default table 0 (additional tables may exist for more complex packet processing via table chaining)

• Rules are executed in order of configured priority

• Each rule is evaluated for match condition and if matched the rule action is executed

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 67 Displaying and understanding OpenFlow rules • By default table 0 contains a single rule that does "normal" processing (MAC address lookup) for all packets: # ovs-ofctl dump-flows br0 NXST_FLOW reply (xid=0x4): cookie=0x0, duration=4.05s, table=0, n_packets=8, n_bytes=784, idle_age=0, priority=0 actions=NORMAL *no match condition = match all • Other rule examples: table=0, in_port=8 actions=output:1 table=0, in_port=1 actions=output:8 table=0, priority=4,in_port=7,dl_vlan=177 actions=mod_vlan_vid:223,NORMAL table=0, priority=100,dl_src=ff:ea:bb:ad:ab:ba actions=CONTROLLER:128 table=0, priority=900,nw_src=10.0.100.12 actions=mod_nw_src:192.168.19.1 table=0, priority=90,nw_dst=192.168.100.13 actions=resubmit(,1) table=1, priority=1,tp_dst=8080 actions=drop

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 68 Quickly checking ACI CNI OpenFlow rules sanity • The idea is to check if ACI CNI control-plane did its part and if traffic is hitting the rules $ kubectl get pod -o wide -l run=alpine NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alpine-6487d4ff57-p98ls 1/1 Running 0 13d 10.11.0.146 ccp1-tenant1-1-node-gr-1 $ kubectl get pod -o wide -n kube-system | grep aci-containers-openvswitch | grep ccp1-tenant1-1-node-gr-1 aci-containers-openvswitch-kv5fx 1/1 Running 1 25d 10.10.1.11 ccp1-tenant1-1-node-gr-1

$ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-ofctl dump-flows -OOpenflow13 br-int | grep 10.11.0.146 cookie=0x0, duration=1178858.198s, table=0, n_packets=3954, n_bytes=166068, priority=40,arp,in_port=19,dl_src=fa:43:d9:79:0c:ae,arp_spa=10.11.0.146 actions=goto_table:1 cookie=0x0, duration=1178858.198s, table=0, n_packets=1203, n_bytes=84825, priority=30,ip,in_port=19,dl_src=fa:43:d9:79:0c:ae,nw_src=10.11.0.146 actions=goto_table:1 cookie=0x0, duration=27565.847s, table=2, n_packets=348, n_bytes=1306982, priority=10,ct_state=-new+est-inv+trk,ip,nw_dst=10.11.0.146 actions=set_field:fa:43:d9:79:0c:ae->eth_dst,output:19 "You have got cookie=0x0, duration=1178858.184s, table=4, n_packets=3915, n_bytes=234900, priority=20,arp,reg4=0x1,reg6=0x1,dl_dst=ff:ff:ff:ff:ff:ff,arp_tpa=10.11.0.146,arp_op=1 actions=load:0x750002to- >NXM_NX_REG2[],load:0x13be kidding - >NXM_NX_REG7[],set_field:fa:43:d9:79:0c:ae->eth_dst,goto_table:11 cookie=0x0, duration=1178858.184s, table=6, n_packets=0, n_bytes=0, priority=500,ip,reg6=0x1,dl_dst=00:22:bd:f8:19:ff,nw_dst=10.11.0.146 actions=load:0x750002->NXM_NX_REG2[],load:0x13->NXM_NX_REG7[],set_field:00:22:bd:f8:19:ff->eth_src,set_field:fa:43:d9:79:0c:aewith us…"- >eth_dst,dec_ttl,write_metadata:0x400/0x400,goto_table:11 ...

$ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-ofctl diff-flows -OOpenflow13 br-int . | grep 10.11.0.146 -priority=40,arp,in_port=19,dl_src=fa:43:d9:79:0c:ae,arp_spa=10.11.0.146 actions=goto_table:1 -priority=30,ip,in_port=19,dl_src=fa:43:d9:79:0c:ae,nw_src=10.11.0.146 actions=goto_table:1 -table=2 priority=10,ct_state=-new+est-inv+trk,ip,nw_dst=10.11.0.146 actions=set_field:fa:43:d9:79:0c:ae->eth_dst,output:19 -table=4 priority=20,arp,reg4=0x1,reg6=0x1,dl_dst=ff:ff:ff:ff:ff:ff,arp_tpa=10.11.0.146,arp_op=1 actions=load:0x750002- >NXM_NX_REG2[],load:0x13->NXM_NX_REG7[],set_field:fa:43:d9:79:0c:ae->eth_dst,goto_table:11 ...

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 69 OVS OpenFlow trace (Yes, MORE tracing! ) ovs-appctl ofproto/trace

• Tool for simulating packet processing by the OpenFlow rules

• Like "ELAM" capture on OVS but without real traffic

• Some command examples: ovs-appctl ofproto/trace br-int ip,in_port=19,dl_src=fa:16:3e:b0:cc:c8,dl_dst=00:22:bd:f8:19:ff, nw_src=10.20.20.103,nw_dst=10.10.10.101

ovs-appctl ofproto/trace br-int icmp,in_port=6,dl_src=fa:16:3e:94:f7:dd,dl_dst=fa:16:3e:20:2c:fe, nw_src=10.10.10.101,nw_dst=10.10.10.100

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 70 But first we need to collect some facts • We want to see what will OVS do when one pod sends traffic to another pod running on a different node alpine hello eth0 • We first need to figure out: eth0 • MAC/IP address for source and destination vethxxx • The OVS container name where the source pod is connected OVS • Correct OVS input port for the source pod container $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE K8s node alpine-6487d4ff57-p98ls 1/1 Running 0 13d 10.11.0.146 ccp1-tenant1-1-node-gr-1 hello-68df895546-kkvt9 1/1 Running 0 93m 10.11.0.147 ccp1-tenant1-1-node-gr-1 hello-68df895546-l6cnq 1/1 Running 0 8m4s 10.11.0.115 ccp1-tenant1-1-node-gr-0 $ kubectl exec -ti alpine-6487d4ff57-p98ls -- ip add show eth0 | grep ether 3: eth0@if29: mtu 1500 qdisc noqueue state UP link/ether fa:43:d9:79:0c:ae brd ff:ff:ff:ff:ff:ff $ kubectl exec -ti hello-68df895546-l6cnq -- ip add show eth0 | grep ether 3: eth0@if30: mtu 1500 qdisc noqueue state UP link/ether a2:a2:33:23:2c:cf brd ff:ff:ff:ff:ff:ff $ kubectl get pod -n kube-system -o wide | grep aci-containers-open | grep ccp1-tenant1-1-node-gr-1 aci-containers-openvswitch-kv5fx 1/1 Running 1 27d 10.10.1.11 ccp1-tenant1-1-node-gr-1

$ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ip add | grep 29: 29: veth499da779@docker0: mtu 1500 qdisc noqueue master ovs-system state UP $ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-ofctl show br-access | grep veth499da779 33(pa-veth499da779): addr:7a:35:3c:e6:53:e3 34(veth499da779): addr:ca:2a:ad:39:92:7e

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 71 ACI CNI OpenFlow rules trace, pod->pod (part1) • Now we can construct the flow expression and run the trace Packet type Input OVS port Source and Destination MAC address tcp,in_port=34,dl_src=fa:43:d9:79:0c:ae,dl_dst=a2:a2:33:23:2c:cf, nw_src=10.11.0.146,nw_dst=10.11.0.115,tp_src=56000,tp_dst=8080

Source and Destination IP address Source and Destination TCP port $ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-appctl ofproto/trace br-access tcp,in_port=34,dl_src=fa:43:d9:79:0c:ae,dl_dst=a2:a2:33:23:2c:cf,nw_src=10.11.0.146,nw_dst=10.11.0.115,tp_src=56000,tp_dst=8080 Flow: tcp,in_port=34,vlan_tci=0x0000,dl_src=fa:43:d9:79:0c:ae,dl_dst=a2:a2:33:23:2c:cf,nw_src=10.11.0.146,nw_dst=10.11.0.115,nw_tos=0 ,nw_ecn=0,nw_ttl=0,tp_src=56000,tp_dst=8080,tcp_flags=0

bridge("br-access") OVS Bridge that is processing the packet OF ------alpine podX table 0. in_port=34,vlan_tci=0x0000/0x1fff, priority 100 Matched rule condition eth0 eth0 load:0x5->NXM_NX_REG6[] load:0x1->NXM_NX_REG0[] vethxxx vethyyy load:0x21->NXM_NX_REG7[] Executed rule actions br-access

goto_table:2 pa-vethxxx pa-vethyyy 2. reg0=0x1, priority 8192 goto_table:3 Result of packet processing on br-acceess. The packet is sent 3. metadata=0/0xff, priority 1 pi-vethxxx pi-vethyyy output:NXM_NX_REG7[] out on port 33 "pa-veth499da779" (see previous slide). This is br-int -> output port is 33 a patch interface to br-int bridge. br-int_vxlan0 ... K8s node bond

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 72 ACI CNI OpenFlow rules trace, pod->pod (part2) ... bridge("br-int") Processing continues in OVS bridge br-int ------0. ip,in_port=19,dl_src=fa:43:d9:79:0c:ae,nw_src=10.11.0.146, priority 30 goto_table:1 1. in_port=19,dl_src=fa:43:d9:79:0c:ae, priority 140 load:0x750002->NXM_NX_REG0[] alpine podX load:0x1->NXM_NX_REG4[] eth0 eth0 load:0x1->NXM_NX_REG5[] load:0x1->NXM_NX_REG6[] vethxxx vethyyy goto_table:3 3. priority 1 br-access goto_table:4 $ kubectl exec -n kube-system aci-containers- pa-vethxxx pa-vethyyy 4. reg4=0x1,dl_dst=a2:a2:33:23:2c:cf, priority 10 openvswitch-kv5fx -- ovs-ofctl show br-int | grep 2\( load:0x750002->NXM_NX_REG2[] 2(vxlan0): addr:12:0c:22:cb:02:21 load:0xa001046->NXM_NX_REG7[] pi-vethxxx pi-vethyyy write_metadata:0x7/0xff br-int goto_table:11 11. reg0=0x750002,reg2=0x750002, priority 8292 br-int_vxlan0 goto_table:12 K8s node 12. metadata=0x7/0xff, priority 15 move:NXM_NX_REG2[]->NXM_NX_TUN_ID[0..31] bond -> NXM_NX_TUN_ID[0..31] is now 0x750002 Load VNI and VTEP move:NXM_NX_REG7[]->NXM_NX_TUN_IPV4_DST[] IP from OF register… -> NXM_NX_TUN_IPV4_DST[] is now 10.0.16.70 $ kubectl exec -n kube-system aci-containers- output:2 …and send traffic via -> output to kernel tunnel openvswitch-kv5fx -- ovs-dpctl show | grep "port 4" VxLAN tunnel using port 2 port 4: vxlan_sys_8472 (vxlan: packet_type=ptap) Final flow: tcp,reg0=0x1,reg6=0x5,reg7=0x21,in_port=34,vlan_tci=0x0000,dl_src=fa:43:d9:79:0c:ae,dl_dst=a2:a2:33:23:2c:cf,nw_src=10.11.0.146,nw_dst=10.11.0.115,nw_tos=0 ,nw_ecn=0,nw_ttl=0,tp_src=56000,tp_dst=8080,tcp_flags=0 Megaflow: recirc_id=0,ct_state=-new-est-inv- trk,ct_mark=0,eth,tcp,in_port=34,vlan_tci=0x0000/0x1fff,dl_src=fa:43:d9:79:0c:ae,dl_dst=a2:a2:33:23:2c:cf,nw_src=10.11.0.146,nw_dst=10.11.0.115,nw_ecn=0,nw _frag=no Datapath actions: set(tunnel(tun_id=0x750002,dst=10.0.16.70,ttl=64,tp_dst=8472,flags(df|key))),4

Kernel flow that is installed for subsequent packets

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 73 ACI CNI OpenFlow rules trace, pod->svc $ kubectl get svc hello NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hello ClusterIP 10.104.102.240 80/TCP 47h

$ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-appctl ofproto/trace br-access tcp,in_port=34,dl_src=fa:43:d9:79:0c:ae,dl_dst=00:22:bd:f8:19:ff,nw_src=10.11.0.146,nw_dst=10.104.102.240,tp_src=56000,tp_dst=80 ... bridge("br-access") ... bridge("br-int") ------... 4. tcp,reg6=0x1,nw_dst=10.104.102.240,tp_dst=80, priority 50 set_field:00:22:bd:f8:19:ff->eth_dst multipath(symmetric_l3l4+udp,1024,iter_hash,2,32,NXM_NX_REG7[]) Load-balancing action -> NXM_NX_REG7[] is now 0 goto_table:5 5. tcp,reg6=0x1,metadata=0/0x200,nw_dst=10.104.102.240,tp_dst=80, priority 99 set_field:8080->tcp_dst set_field:10.11.0.115->ip_dst dec_ttl DNAT: set new dst IP and port >> IPv4 decrement TTL exception ... Megaflow: recirc_id=0,ct_state=-new-est-inv- trk,ct_mark=0,eth,tcp,in_port=34,vlan_tci=0x0000/0x1fff,dl_src=fa:43:d9:79:0c:ae,dl_dst=00:22:bd:f8:19:ff,nw_src=10.11.0.146,nw_dst=10.104.1 02.240,nw_ttl=0,nw_frag=no,tp_src=56000,tp_dst=80 Datapath actions: set(ipv4(src=10.11.0.146,dst=10.11.0.115,ttl=0)),set(tcp(src=56000,dst=8080)), userspace(pid=2519377860,controller(reason=2,dont_send=1,continuation=0,recirc_id=163924,rule_cookie=0,controller_id=0,max_len=65535))

Packet is hitting conntrack action which skews the trace output due to bug/limitation in OVS

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 74 Displaying and Understanding OVS kernel flows • Flow entries in the OVS kernel module expire very fast – by default after 10 sec.

• These signify live traffic going through the OVS (not a simulation)

$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED alpine-6487d4ff57-p98ls 1/1 Running 0 13d 10.11.0.146 ccp1-tenant1-1-node-gr-1 hello-68df895546-kkvt9 1/1 Running 0 10s 10.11.0.147 ccp1-tenant1-1-node-gr-1 $ kubectl get pod -n kube-system -o wide | grep aci-containers-open | grep ccp1-tenant1-1-node-gr-1 alpine hello aci-containers-openvswitch-kv5fx 1/1 Running 1 25d 10.10.1.11 ccp1-tenant1-1-node-gr-1 eth0 eth0

$ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-dpctl dump-flows --names | grep 10.11.0.146 | grep 10.11.0.147 vethxxx vethyyy

recirc_id(0),in_port(vetha82d17a5),ct_state(-new-est-inv- OVS kernel trk),ct_mark(0),eth(src=d2:1e:e9:e9:4a:d3,dst=fa:43:d9:79:0c:ae),eth_type(0x0800),ipv4(src=10.11.0.147,dst=10.1 module 1.0.146,proto=6,frag=no), packets:3, bytes:613, used:2.568s, flags:FP., actions:veth499da779

recirc_id(0),in_port(veth499da779),ct_state(-new-est-inv- K8s node

symmetric trk),ct_mark(0),eth(src=fa:43:d9:79:0c:ae,dst=d2:1e:e9:e9:4a:d3),eth_type(0x0800),ipv4(src=10.11.0.146,dst=10.1 1.0.147,proto=6,frag=no), packets:5, bytes:410, used:2.568s, flags:FP., actions:vetha82d17a5

$ kubectl exec -ti alpine-6487d4ff57-p98ls -- ip add show eth0 $ kubectl exec -ti hello-68df895546-kkvt9 -- ip add show eth0 3: eth0@if29: mtu 1500 3: eth0@if30: mtu 1500 link/ether fa:43:d9:79:0c:ae brd ff:ff:ff:ff:ff:ff link/ether d2:1e:e9:e9:4a:d3 brd ff:ff:ff:ff:ff:ff inet 10.11.0.146/16 brd 10.11.255.255 scope global eth0 inet 10.11.0.147/16 brd 10.11.255.255 scope global eth0

$ ip link show | egrep "29:|30:" | cut -c -105 29: veth499da779@if3: mtu 1500 qdisc noqueue master ovs-system state UP 30: vetha82d17a5@if3: mtu 1500 qdisc noqueue master ovs-system state UP

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 75 ACI CNI OVS kernel flows pod->svc, load-balanced on different node

$ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE alpine-6487d4ff57-p98ls 1/1 Running 0 13d 10.11.0.146 ccp1-tenant1-1-node-gr-1 hello-68df895546-kkvt9 1/1 Running 0 93m 10.11.0.147 ccp1-tenant1-1-node-gr-1 hello-68df895546-l6cnq 1/1 Running 0 8m4s 10.11.0.115 ccp1-tenant1-1-node-gr-0 $ kubectl get svc hello NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hello ClusterIP 10.104.102.240 80/TCP 7m47s alpine podX $ kubectl get ep hello eth0 eth0 NAME ENDPOINTS AGE hello 10.11.0.115:8080,10.11.0.147:8080 7m52s vethxxx vethyyy OVS $ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-dpctl dump-flows --names | grep 10.11.0.146 kernel module recirc_id(0),in_port(veth499da779),ct_state(-new-est-inv- trk),ct_mark(0),eth(src=fa:43:d9:79:0c:ae,dst=00:22:bd:f8:19:ff),eth_type(0x0800),ipv4(src=10.11.0.146,dst=10.104. br-int_vxlan0 102.240,proto=6,tos=0/0x3,ttl=64,frag=no),tcp(src=56416,dst=80), packets:5, bytes:408, used:2.281s, flags:FP., K8s node actions:set(ipv4(src=10.11.0.146,dst=10.11.0.115,ttl=63)),set(tcp(src=56416,dst=8080)),ct(commit,zone=2,mark=0xc/0 bond xffffffff),set(tunnel(tun_id=0x750002,dst=10.0.16.70,ttl=64,tp_dst=8472,flags(df|key))),vxlan_sys_8472 ACI fabric assigned VNI ACI leaf VTEP Output interface recirc_id(0x227b4),tunnel(tun_id=0x750002,src=10.0.16.70,dst=10.0.16.71,flags(-df- symmetric csum+key)),in_port(vxlan_sys_8472),ct_state(-new+est- inv+trk),ct_mark(0xc),eth(src=a2:a2:33:23:2c:cf,dst=fa:43:d9:79:0c:ae),eth_type(0x0800),ipv4(src=10.11.0.115,dst=1 0.11.0.146,proto=6,ttl=64,frag=no),tcp(src=8080), packets:3, bytes:613, used:2.281s, flags:FP., actions:set(eth(src=00:22:bd:f8:19:ff,dst=fa:43:d9:79:0c:ae)),set(ipv4(src=10.104.102.240,dst=10.11.0.146,ttl=63)) ,set(tcp(src=80)),veth499da779

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 76 For your ACI CNI LoadBalancer reference • LoadBalancer service is implemented by External EPG-Contract-Service Graph- PBR Policy construct in the ACI fabric

• The traffic is delivered to one of the nodes via a special VLAN expected by OVS

• OVS then does another layer of load-balancing to determine which pod behind the service gets the traffic (this is standard K8s modus operandi) $ kubectl exec -n kube-system aci-containers-openvswitch-kv5fx -- ovs-appctl ofproto/trace br-int tcp,in_port=1,dl_vlan=3311,dl_src=e4:d3:f1:67:17:7c,dl_dst=00:50:56:a7:5a:bf,nw_src=10.48.31.158,nw_dst=10.10.1.194,tp_src=42219,tp_dst=443 ... bridge("br-int") ------0. in_port=1,dl_vlan=3311, priority 90 pop_vlan load:0xcef->NXM_NX_REG0[] load:0x1->NXM_NX_REG6[] write_metadata:0x300/0x300 goto_table:4 4. tcp,reg6=0x1,dl_dst=00:50:56:a7:5a:bf,nw_dst=10.10.1.194,tp_dst=443, priority 50 set_field:00:22:bd:f8:19:ff->eth_dst multipath(symmetric_l3l4+udp,1024,iter_hash,1,32,NXM_NX_REG7[]) -> NXM_NX_REG7[] is now 0 goto_table:5 5. tcp,reg6=0x1,metadata=0x200/0x200,nw_dst=10.10.1.194,tp_dst=443, priority 99 set_field:10.11.0.130->ip_dst dec_ttl >> IPv4 decrement TTL exception

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 77 NodePort + ACI CNI

• In this case the destination is node IP, so the traffic is handled by iptables (same as before)

• iptables performs load-balancing (if needed), DNAT and SNAT

• Packet is the routed directly to pod IP

• This means packet is sent back to ACI fabric (default GW) which then delivers it to the target node running the pod via VxLAN

• OVS then takes it further and delivers the packet to pod

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 78 Contiv-VPP contivpp.io Contiv-VPP

100% Open Source https://github.com/contiv/vpp

Optimized for Performance and Scale Uses https://fd.io/technology/ Vector Path Processing (VPP) contivpp.io Supports Any Networking Underlay ACI-CNI is better suited for ACI fabric K8s Container Network Interface (CNI) Plugin for Network Connectivity and Security Available within Cisco Container Platform CNI for Production Grade Container Environments

Easy installation | User space; No kernel tax | Provides container traffic operational visibility / monitoring / debugging | World-class advisory & support

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 80 Container Networking Control: Contiv-VPP

• Contiv is a networking plugin for Kubernetes that: • Allocates IP addresses to Pods (IPAM) • Programs the underlying infrastructure it uses (Linux TCP/IP stack, OVS, VPP, …) to provide pod network connectivity • Implements K8s network policies • Implements K8s services

• Contiv is a user-space based, high-performance, high-density networking plugin for Kubernetes

• Contiv-VPP leverages FD.io/VPP as the industry’s highest performance data plane

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 81 Contiv-VPP Network VxLAN Overlay

Node 2

Pod 4 Pod 5 Pod 6

VPP

Node 1 BVI

Pod 1 Pod 2 Pod 3 BD Node 3

Pod 7 Pod 8 Pod 9 VPP

BVI VXLAN tunnel mesh with single VNI VPP BD BVI

BD

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 82 VPP The Universal Dataplane - FD.io Vector Packet Processor (VPP)

• VPP is a rapid packet processing development platform for highly performing network applications NC/Y REST ... • 18+ MPPS, single core Management Agent • Multimillion entry FIBs • 480Gbps bi-dir on 24 cores

• Runs on commodity CPUs and leverages DPDK

• Creates a vector of packet indices and processes them using a directed graph of nodes – resulting in a highly performant solution.

• Runs as a Linux user-space application

• Ships as part of both embedded & server products, in Packet Processing: VPP volume; Active development since 2002 Network IO • FD.IO (The Fast Data Project)

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 84 vppctl Quick Guide https://wiki.fd.io/view/VPP/Command-line_Interface_(CLI)_Guide • Utility for displaying and managing VPP config

• Like using a real switch: show int, auto-complete… Node 1

• Access locally or remotely via contiv-netctl Pod 1 Pod 2 Pod 3

$ sudo vppctl ______VPP __/ __/ _ \ (_)__ | | / / _ \/ _ \ BVI _/ _// // / / / _ \ | |/ / ___/ ___/ /_/ /____(_)_/\___/ |___/_/ /_/ BD

vpp# show version vpp v19.08-17~g623a1b705 built by root on 2c5b598eddff at Fri Sep 20 08:25:22 UTC 2019 vpp# vpp# show bridge-domain 1 detail BD-ID Index BSN Age(min) Learning U-Forwrd UU-Flood Flooding ARP-Term arp-ufwd BVI-Intf Useful commands: 1 1 0 off off on drop off off off loop1 show int show int addr Interface If-idx ISN SHG BVI TxFlood VLAN-Tag-Rewrite show int tag loop1 4 1 1 * * none show hardware [brief] vxlan_tunnel0 6 1 1 - * none show bridge [X detail] vxlan_tunnel1 7 1 1 - * none show vxlan tunnel vpp# show vxlan tunnel show adj [0] instance 0 src 192.168.1.49 dst 192.168.1.51 vni 10 fib-idx 0 sw-if-idx 6 encap-dpo-idx 8 show ip arp [1] instance 1 src 192.168.1.49 dst 192.168.1.53 vni 10 fib-idx 0 sw-if-idx 7 encap-dpo-idx 10 show ip fib [table 0/1] show l2fib verbose show nat44 static mapping show load-balance

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 85 Kubernetes Services implementation in Contiv-VPP

• Service Processor • matches K8s service metadata (K8s API) with

plane endpoints (labels to PODs) - • Service Renderers • render K8s service data into VPP & Linux (POD)

configuration control • VPP NAT Plugin, an implementation of NAT44 • Kubernetes leverages two modes of the

plane NAT plugin: - • Load-balance and forward IPv4 traffic between

services and endpoints (i.e. 1-to-many NAT) data • Dynamically SNAT node-outbound IPv4 traffic, to enable Internet access for pods with private addresses

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 86 VPP ClusterIP implementation

$ kubectl get svc hello NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hello ClusterIP 10.100.233.57 80/TCP 11h $ kubectl get ep hello NAME ENDPOINTS AGE hello 10.0.1.17:8080 11h

$ kubectl exec -ti -n kube-system contiv-vswitch-tbkk9 -- /usr/bin/vppctl show nat44 static mapping | grep 10.100.233.57 tcp local 10.0.1.17:8080 external 10.100.233.57:80 vrf 1 self-twice-nat out2in-only

$ kubectl scale deploy hello --replicas=2 deployment.extensions/hello scaled $ kubectl get ep hello NAME ENDPOINTS AGE hello 10.0.1.146:8080,10.0.1.17:8080 11h

$ kubectl exec -ti -n kube-system contiv-vswitch-tbkk9 -- /usr/bin/vppctl show nat44 static mapping | grep -A 2 10.100.233.57 tcp external 10.100.233.57:80 self-twice-nat out2in-only local 10.0.1.146:8080 vrf 1 probability 1 local 10.0.1.17:8080 vrf 1 probability 1

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 87 Contiv-VPP node internal data path

Pod1 Pod2 PodN PodM pod<->pod pod<->svc iptables NAT tap-1 tap-2 tap-n

lo1 LB->pod BVI tap-0 vpp1 NodePort->pod BD1 VPP (VPP NAT) Host stack

enp0s9 Gige0/8/0

NIC NIC

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 88 Challenges with VPP vpp# show nat44 static mappings NAT44 static mappings: tcp local 192.168.1.54:6443 external 10.96.0.1:443 vrf 0 self-twice-nat out2in-only tcp local 192.168.1.54:12379 external 192.168.1.49:32379 vrf 0 self-twice-nat out2in-only tcp local 192.168.1.54:12379 external 192.168.1.54:32379 vrf 0 self-twice-nat out2in-only tcp local 192.168.1.54:12379 external 192.168.1.51:32379 vrf 0 self-twice-nat out2in-only • Config quite big and seems difficult to tcp local 192.168.1.54:12379 external 192.168.1.50:32379 vrf 0 self-twice-nat out2in-only tcp local 192.168.1.54:12379 external 192.168.1.53:32379 vrf 0 self-twice-nat out2in-only tcp local 192.168.1.54:12379 external 192.168.1.52:32379 vrf 0 self-twice-nat out2in-only tcp local 192.168.1.54:12379 external 10.101.231.130:12379 vrf 0 self-twice-nat out2in-only understand in the start identity mapping udp 192.168.1.49:4789 vrf 0 identity mapping 192.168.1.49 vrf 1 vrf 0 tcp local 10.0.1.11:5601 external 10.109.21.185:5601 vrf 1 self-twice-nat out2in-only tcp local 10.0.1.139:3000 external 10.103.203.27:80 vrf 1 self-twice-nat out2in-only • Figuring out what is connected to what tcp local 10.0.1.14:9093 externalvpp# show 10.97.231.127:80 ip fib table 1vrf 1 self-twice-nat out2in-only tcp local 10.0.1.136:9091 externalpodVRF- IPv4,10.111.254.43:9091 fib_index:1, flowvrf 1hash:[ self-srctwicedst-natsportout2in dport-onlyproto ] locks:[src:API:6, tcp local 10.0.1.138:443 externalsrc:plugin 10.102.7.28:443-low:64, src:adjacency:5, vrf 1 self-twice src:recursive-nat out2in--resolution:1,only ] takes effort tcp local 10.0.1.13:9200 external0.0.0.0/0 10.96.94.175:9200 vrf 1 self-twice-nat out2in-only tcp local 10.0.1.7:8443 externalunicast 10.106.22.51:443-ip4-chain vrf 1 self-twice-nat out2in-only tcp external 192.168.1.49:30130[@0]: self dpo-twice-load-nat-balance:out2in -[proto:ip4only index:23 buckets:1 uRPF:41 to:[5197357:494796587]] local 10.0.1.134:80 vrf 1 probability[0] [@12]: 1 dst-address,unicast lookup in ipv4-VRF:0 local 10.0.1.5:80 vrf 1 probability0.0.0.0/32 1 • All interfaces except container taps are tcp external 192.168.1.49:30071unicast self--twiceip4-chain-nat out2in-only local 10.0.1.134:443 vrf 1 probability[@0]: dpo- load1 -balance: [proto:ip4 index:24 buckets:1 uRPF:26 to:[0:0]] local 10.0.1.5:443 vrf 1 probability[0] [@0]: 1 dpo-drop ip4 invisible to Linux kernel – this creates tcp external 192.168.1.54:3013010.0.0.0/16 self-twice-nat out2in-only local 10.0.1.134:80 vrf 1 probabilityunicast-ip4 1 -chain local 10.0.1.5:80 vrf 1 probability[@0]: dpo 1 -load-balance: [proto:ip4 index:41 buckets:1 uRPF:42 to:[0:0]] discomfort tcp external 192.168.1.54:30071 [0]self [@0]:-twice dpo-nat-dropout2in ip4-only local 10.0.1.134:443 vrf 1 10.0.0.128/32probability 1 local 10.0.1.5:443 vrf 1 probabilityunicast- ip41 -chain tcp external 192.168.1.51:30071[@0]: self dpo-twice-load-nat-balance:out2in -[proto:ip4only index:45 buckets:1 uRPF:46 to:[0:0]] local 10.0.1.134:443 vrf 1 probability[0] [@0]: 1 dpo-drop ip4 • tcpdump only works on eth0 in local 10.0.1.5:443 vrf 1 probability10.0.0.128/25 1 tcp external 192.168.1.51:30130unicast self--twiceip4-chain-nat out2in-only local 10.0.1.134:80 vrf 1 probability[@0]: dpo -1load-balance: [proto:ip4 index:44 buckets:1 uRPF:45 to:[0:0]] container namespace local 10.0.1.5:80 vrf 1 probability[0] [@4]: 1 ipv4-glean: loop0: mtu:9000 ffffffffffffdead000000000806 ... 10.0.0.129/32 unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:47 buckets:1 uRPF:50 to:[0:0]] [0] [@2]: dpo-receive: 10.0.0.129 on loop0 • Packet processes using a directed 10.0.0.130/32 unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:28 buckets:1 uRPF:30 to:[784333:216136002]] [0] [@5]: ipv4 via 10.0.0.130 tap1: mtu:1422 02fe022bb37202fefdd44c8d0800 graph of nodes is super fast – but 10.0.0.131/32 unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:29 buckets:1 uRPF:31 to:[779931:215899235]] difficult to understand [0] [@5]: ipv4 via 10.0.0.131 tap2: mtu:1422 02fec585e84402fe3a7a17bb0800 10.0.0.132/32 unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:30 buckets:1 uRPF:32 to:[365598:27592198]] [0] [@5]: ipv4 via 10.0.0.132 tap3: mtu:1422 02fe8a66705e02fe75998fa10800 10.0.0.255/32 So, what can we do? unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:46 buckets:1 uRPF:48 to:[0:0]] [0] [@0]: dpo-drop ip4

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 89 MORE TRACING!!!! VPP trace https://wiki.fd.io/view/VPP/Command-line_Interface_(CLI)_Guide#packet_tracer • VPP's built in packet capture/trace tool

• Good news: Prints all packet processing stages and decisions (very useful!)

• Bad news: A bit difficult to use raw and filter out interesting traffic

• Contiv-VPP comes with a bash script that makes it easy to use! $ kubectl exec -ti -n kube-system contiv-vswitch-7qqvf -- bash root@ccp2-tenant1-master7677945a00:~# ./vpptrace.sh -h Usage: ./vpptrace.sh [-i ]... [-a ] [-r] [-f / ] -i : VPP interface *type* to run the packet capture on (e.g. dpdk-input, virtio-input, etc.) - available aliases: - af-packet-input: afpacket, af-packet, veth - virtio-input: tap (version determined from the VPP runtime config), tap2, tapv2 - tapcli-rx: tap (version determined from the VPP config), tap1, tapv1 - dpdk-input: dpdk, gbe, phys* - encryption: ipsec, encrypt* - ipsec decryption: ipsec, decrypt* - memif-input: mem* - multiple interfaces can be watched at the same time - the option can be repeated with different values - default = dpdk + tap -a : IP address or hostname of the VPP to capture packets from - not supported if VPP listens on a UNIX domain socket - default = 127.0.0.1 -r : apply filter string (passed with -f) as a regexp expression - by default the filter is NOT treated as regexp -f : filter string that packet must contain (without -r) or match as regexp (with -r) to be printed - default is no filtering

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 91 VPP trace output (pod->pod, same node) $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alpine-6487d4ff57-2whmx 1/1 Running 0 87m 10.0.1.18 ccp2-tenant1-worker5afd21cebf hello-68df895546-nsbtb 1/1 Running 0 101m 10.0.1.17 ccp2-tenant1-worker5afd21cebf $ kubectl exec -ti -n kube-system contiv-crd-th95x -- ./contiv-netctl pods ccp2-tenant1-worker5afd21cebf | egrep "POD-|default" POD-NAME NAMESPACE POD-IP IF-IDX IF-NAME alpine-6487d4ff57-2whmx default 10.0.1.18 13 tap4 hello-68df895546-nsbtb default 10.0.1.17 11 tap6 $ kubectl exec -ti -n kube-system contiv-vswitch-tbkk9 -- ./vpptrace.sh -i tap -f 10.0.1.18 Packet 1: 03:03:43:074507: virtio-input $ kubectl exec -ti alpine-6487d4ff57-2whmx -- curl http://10.0.1.17:8080 virtio: hw_if_index 7 next-index 4 vring 0 len 74 hdr: flags 0x01 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 34 csum_offset 16 num_buffers 1 03:03:43:074515: ethernet-input IP4: 02:fe:75:ba:2a:e9 -> 02:fe:8a:45:d5:16 $ kubectl exec -ti -n kube-system contiv-vswitch-tbkk9 03:03:43:074520: ip4-input -- /usr/bin/vppctl show hard brief tap4 TCP: 10.0.1.18 -> 10.0.1.17 Stages Name Idx Link Hardware tos 0x00, ttl 64, length 60, checksum 0xd72e tap4 7 up tap4 fragment id 0x496b, flags DONT_FRAGMENT (nodes in VPP forwarding graph) TCP: 46444 -> 8080 seq. 0xb32e684e ack 0x00000000 flags 0x02 SYN, tcp header: 40 bytes window 27640, checksum 0x0000 ... 03:03:43:074558: tap6-output tap5 ip4 offload-ip-cksum offload-tcp-cksum l2_hdr_offset_valid l3_hdr_offset_valid l4_hdr_offset_valid IP4: 02:fe:ea:12:8f:df -> 02:fe:15:ed:70:20 TCP: 10.0.1.18 -> 10.0.1.17 tos 0x00, ttl 63, length 60, checksum 0xd82e fragment id 0x496b, flags DONT_FRAGMENT TCP: 46444 -> 8080 ...

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 92 VPP trace (pod->svc, same node)

$ kubectl get svc hello NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE hello ClusterIP 10.100.233.57 80/TCP 3h4m $ kubectl get ep hello NAME ENDPOINTS AGE 01:15:23:764357: ip4-rewrite hello 10.0.1.17:8080 3h4m tx_sw_if_index 9 dpo-idx 11 : ipv4 via 10.0.1.17 tap6: mtu:1422… $ kubectl exec -ti -n kube-system contiv-vswitch-tbkk9 -- ./vpptrace.sh -i 00000000: 02fe15ed702002feea128fdf08004500003c672c40003f06ba6d00… tap -f 10.0.1.18 00000020: 0111e28a1f905b9fe15300000000a0026bf8c74c0000020405660402 Packet 1: 01:15:23:764360: tap6-output 01:15:23:764313: virtio-input tap6 ip4 offload-ip-cksum offload-tcp-cksum l2_hdr_offset_valid virtio: hw_if_index 7 next-index 4 vring 0 len 74 l3_hdr_offset_valid l4_hdr_offset_valid hdr: flags 0x01 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 34 IP4: 02:fe:ea:12:8f:df -> 02:fe:15:ed:70:20 csum_offset 16 num_buffers 1 TCP: 10.0.1.18 -> 10.0.1.17 01:15:23:764320: ethernet-input tos 0x00, ttl 63, length 60, checksum 0xba6d IP4: 02:fe:75:ba:2a:e9 -> 02:fe:8a:45:d5:16 fragment id 0x672c, flags DONT_FRAGMENT 01:15:23:764324: ip4-input TCP: 57994 -> 8080 TCP: 10.0.1.18 -> 10.100.233.57 seq. 0x5b9fe153 ack 0x00000000 tos 0x00, ttl 64, length 60, checksum 0xd2e0 flags 0x02 SYN, tcp header: 40 bytes fragment id 0x672c, flags DONT_FRAGMENT window 27640, checksum 0xc74c TCP: 57994 -> 80 seq. 0x5b9fe153 ack 0x00000000 flags 0x02 SYN, tcp header: 40 bytes window 27640, checksum 0x0000 DNAT 01:15:23:764328: nat44-ed-out2in NAT44_OUT2IN_ED_FAST_PATH: sw_if_index 7, next index 4, session -1 01:15:23:764331: nat44-ed-out2in-slowpath NAT44_OUT2IN_ED_SLOW_PATH: sw_if_index 7, next index 1, session 79 01:15:23:764354: ip4-lookup fib 1 dpo-idx 11 flow hash: 0x00000000 TCP: 10.0.1.18 -> 10.0.1.17 tos 0x00, ttl 64, length 60, checksum 0xb96d fragment id 0x672c, flags DONT_FRAGMENT TCP: 57994 -> 8080 seq. 0x5b9fe153 ack 0x00000000 flags 0x02 SYN, tcp header: 40 bytes window 27640, checksum 0xc74c

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 93 VPP trace (pod->svc, different node) $ kubectl exec -ti alpine-6487d4ff57-2whmx -- nslookup kubernetes Packet 1: 01:26:20:497664: virtio-input 01:26:20:497778: l2-input virtio: hw_if_index 7 next-index 4 vring 0 len 96 l2-input: sw_if_index 16 dst 12:2b:00:00:00:01 src 12:2b:00:00:00:02 hdr: flags 0x01 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 34 … 01:26:20:497782: l2-fwd 01:26:20:497673: ethernet-input l2-fwd: sw_if_index 16 dst 12:2b:00:00:00:01 src 12:2b:00:00:00:02 bd_index… IP4: 02:fe:75:ba:2a:e9 -> 02:fe:8a:45:d5:16 01:26:20:497786: l2-output 01:26:20:497688: ip4-input l2-output: sw_if_index 14 dst 12:2b:00:00:00:01 src 12:2b:00:00:00:02 data … UDP: 10.0.1.18 -> 10.96.0.10 01:26:20:497789: vxlan4-encap tos 0x00, ttl 64, length 82, checksum 0x68e6 VXLAN encap to vxlan_tunnel0 vni 10 fragment id 0xba39, flags DONT_FRAGMENT 01:26:20:497791: ip4-rewrite VxLAN UDP: 45664 -> 53 tx_sw_if_index 1 dpo-idx 3 : ipv4 via 192.168.1.49 GigabitEthernet13/0/0: … length 62, checksum 0x0000 00000000: 005056a7339d005056a74ec708004500008400000000fd1139b4c0a80133c0a8 01:26:20:497696: nat44-ed-out2in 00000020: 01310d5812b5007000000800000000000a00122b00000001122b0000 NAT44_OUT2IN_ED_FAST_PATH: sw_if_index 7, next index 4, session -1 01:26:20:497793: nat44-ed-in2out-output 01:26:20:497701: nat44-ed-out2in-slowpath NAT44_IN2OUT_ED_FAST_PATH: sw_if_index 16, next index 3, session -1 NAT44_OUT2IN_ED_SLOW_PATH: sw_if_index 7, next index 1, session 65 01:26:20:497796: nat44-ed-in2out-output-slowpath 01:26:20:497761: ip4-lookup NAT44_IN2OUT_ED_SLOW_PATH: sw_if_index 16, next index 0, session -1 fib 1 dpo-idx 16 flow hash: 0x00000000 01:26:20:497801: GigabitEthernet13/0/0-output UDP: 10.0.1.18 -> 10.0.0.132 GigabitEthernet13/0/0 ip4 l2_hdr_offset_valid l3_hdr_offset_valid l4_ … tos 0x00, ttl 64, length 82, checksum 0x66cc IP4: 00:50:56:a7:4e:c7 -> 00:50:56:a7:33:9d fragment id 0xba39, flags DONT_FRAGMENT UDP: 192.168.1.51 -> 192.168.1.49 UDP: 45664 -> 53 tos 0x00, ttl 253, length 132, checksum 0x39b4 length 62, checksum 0x0000 fragment id 0x0000 01:26:20:497766: ip4-rewrite UDP: 3416 -> 4789 tx_sw_if_index 16 dpo-idx 16 : ipv4 via 10.2.2.1 loop1: mtu:9000 length 112, checksum 0x0000 122b00000001122b000000020800 flow hash: 0x00000000 01:26:20:497804: GigabitEthernet13/0/0-tx 00000000: 122b00000001122b00000002080045000052ba3940003f1167cc0c00011 … GigabitEthernet13/0/0 tx queue 0 00000020: 0084b2600035003e0000728f010000010000000000000a6b75626572 buffer 0x82a5d: current data -50, length 146, buffer-pool 0, ref-count 1, … 01:26:20:497771: loop1-output ip4 l2-hdr-offset 0 l3-hdr-offset 14 l4-hdr-offset 34 loop1 ip4 offload-ip-cksum offload-udp-cksum l2_hdr_offset_valid PKT MBUF: port 65535, nb_segs 1, pkt_len 146 l3_hdr_offset_valid l4_hdr_offset_valid buf_len 2176, data_len 146, ol_flags 0x0, data_off 78, phys_addr 0x8eaa97c0 IP4: 12:2b:00:00:00:02 -> 12:2b:00:00:00:01 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 UDP: 10.0.1.18 -> 10.0.0.132 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 tos 0x00, ttl 63, length 82, checksum 0x67cc IP4: 00:50:56:a7:4e:c7 -> 00:50:56:a7:33:9d fragment id 0xba39, flags DONT_FRAGMENT UDP: 192.168.1.51 -> 192.168.1.49 UDP: 45664 -> 53 tos 0x00, ttl 253, length 132, checksum 0x39b4 length 62, checksum 0x0000 fragment id 0x0000 UDP: 3416 -> 4789 length 112, checksum 0x0000

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 94 For your Tracing NodePort+Contiv-VPP like a pro (part1) reference

Pod1 Pod2 hello PodM Pod1 Pod2 PodN PodM

tap1 tap2 tap6 tap1 tap2 tapn lo1 lo1 BVI tap0 vpp1 BVI tap0 vpp1 BD1 BD1 Host stack Host stack ens192 ens192 VPP Gig13/0/0 VPP Gig13/0/0

NIC NIC VxLAN NIC NIC

$ sudo iptables -t raw -j TRACE -p tcp --dport 30752 --tcp-flags SYN SYN -I PREROUTING 1 $ grep TRACE trace | sed 's/IN=/\; IN=/' | column -t -s $';' | sed 's/.*TRACE/TRACE/' | sed -E 's/MAC=.{42}//' | sed 's/LEN=.*DF //' | cut -c -125 TRACE: raw:PREROUTING:policy:2 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:PREROUTING:rule:1 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:KUBE-SERVICES:rule:39 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:KUBE-NODEPORTS:rule:3 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:KUBE-MARK-MASQ:rule:1 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:KUBE-MARK-MASQ:return:2 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:KUBE-NODEPORTS:rule:4 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:KUBE-SVC-TFDSBUMN6I7R7DUY:rule:1 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: nat:KUBE-SEP-OND2O4JIN7EQGJ7I:rule:2 IN=ens192 OUT= SRC=192.168.1.1 DST=192.168.1.48 PROTO=TCP SPT=60996 DPT=30752 TRACE: filter:FORWARD:rule:1 IN=ens192 OUT=vpp1 SRC=192.168.1.1 DST=10.0.1.17 PROTO=TCP SPT=60996 DPT=8080 TRACE: filter:KUBE-FORWARD:rule:1 IN=ens192 OUT=vpp1 SRC=192.168.1.1 DST=10.0.1.17 PROTO=TCP SPT=60996 DPT=8080 TRACE: nat:POSTROUTING:rule:1 IN= OUT=vpp1 SRC=192.168.1.1 DST=10.0.1.17 PROTO=TCP SPT=60996 DPT=8080 SEQ=12 TRACE: nat:KUBE-POSTROUTING:rule:1 IN= OUT=vpp1 SRC=192.168.1.1 DST=10.0.1.17 PROTO=TCP SPT=60996 DPT=8080 SEQ=12 $ sudo iptables -t nat -nvL KUBE-POSTROUTING 1 0 0 MASQUERADE all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service traffic requiring SNAT */ mark match… $ ip add show vpp1 | grep inet inet 10.1.0.130/25 brd 10.1.0.255 scope global vpp1

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 95 For your Tracing NodePort+Contiv-VPP like a pro (part2) reference $ kubectl exec -ti -n kube-system contiv-vswitch-7qqvf -- ./vpptrace.sh -i tap -f 10.0.1.17 Packet 1: 00:02:29:745785: virtio-input Pod1 Pod2 PodN PodM virtio: hw_if_index 3 next-index 4 vring 0 len 74 hdr: flags 0x02 gso_type 0x00 hdr_len 0 gso_size 0 csum_start … 00:02:29:745798: ethernet-input tap1 tap2 tapn IP4: 4e:8b:60:8c:a2:85 -> 34:3c:00:00:00:01 lo1 00:02:29:745802: ip4-input BVI tap0 vpp1 TCP: 10.1.0.130 -> 10.0.1.17 BD1 tos 0x00, ttl 63, length 60, checksum 0xa06d Host stack ens192 fragment id 0x81bb, flags DONT_FRAGMENT VPP Gig13/0/0 TCP: 60996 -> 8080 seq. 0x491dbb1f ack 0x00000000 VxLAN NIC NIC flags 0x02 SYN, tcp header: 40 bytes window 29200, checksum 0xb68f ... 00:02:29:745845: vxlan4-encap VXLAN encap to vxlan_tunnel0 vni 10 00:02:29:745848: ip4-rewrite 00:02:29:745858: GigabitEthernet13/0/0-tx tx_sw_if_index 1 dpo-idx 2 : ipv4 via 192.168.1.51 GigabitEthernet13/0/0… GigabitEthernet13/0/0 tx queue 0 ... buffer 0x9a619: current data -50, length 124, buffer-pool 0… 00:02:29:745856: GigabitEthernet13/0/0-output l2-hdr-offset 0 l3-hdr-offset 14 loop-counter 1 GigabitEthernet13/0/0 loop-counter-valid l2_hdr_offset_valid l3_hdr … PKT MBUF: port 65535, nb_segs 1, pkt_len 124 IP4: 00:50:56:a7:33:9d -> 00:50:56:a7:4e:c7 buf_len 2176, data_len 124, ol_flags 0x0, data_off 78, … UDP: 192.168.1.49 -> 192.168.1.51 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 tos 0x00, ttl 253, length 110, checksum 0x39ca rss 0x0 fdir.hi 0x0 fdir.lo 0x0 fragment id 0x0000 IP4: 00:50:56:a7:33:9d -> 00:50:56:a7:4e:c7 UDP: 49744 -> 4789 UDP: 192.168.1.49 -> 192.168.1.51 length 90, checksum 0x0000 tos 0x00, ttl 253, length 110, checksum 0x39ca fragment id 0x0000 UDP: 49744 -> 4789 length 90, checksum 0x0000

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 96 For your Tracing NodePort+Contiv-VPP like a pro (part3) reference $ kubectl exec -ti -n kube-system contiv-vswitch-tbkk9 -- ./vpptrace.sh -i dpdk -f 10.0.1.17 Packet 1: Pod1 Pod2 hello PodM 06:29:30:240651: dpdk-input GigabitEthernet13/0/0 rx queue 0 buffer 0x6fbde: current data 0, length 124, buffer-pool 0, ref-count 1… tap1 tap2 tap6 ext-hdr-valid lo1 l4-cksum-computed l4-cksum-correct BVI tap0 vpp1 PKT MBUF: port 0, nb_segs 1, pkt_len 124 BD1 buf_len 2176, data_len 124, ol_flags 0x180, data_off 128, … Host stack ens192 packet_type 0x291 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 VPP Gig13/0/0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 Packet Offload Flags NIC VxLANNIC PKT_RX_IP_CKSUM_GOOD (0x0080) IP cksum of RX pkt. is valid PKT_RX_L4_CKSUM_GOOD (0x0100) L4 cksum of RX pkt. is valid Packet Types RTE_PTYPE_L2_ETHER (0x0001) Ethernet packet RTE_PTYPE_L3_IPV4_EXT_UNKNOWN (0x0090) IPv4 packet with or without… RTE_PTYPE_L4_UDP (0x0200) UDP packet IP4: 00:50:56:a7:33:9d -> 00:50:56:a7:4e:c7 ... UDP: 192.168.1.49 -> 192.168.1.51 06:29:30:240711: ip4-rewrite tos 0x00, ttl 253, length 110, checksum 0x39ca tx_sw_if_index 9 dpo-idx 11 : ipv4 via 10.0.1.17 tap6: mtu:1422… fragment id 0x0000 00000000: 02fe15ed702002feea128fdf08004500003c81bb40003d06a26d0c… UDP: 49744 -> 4789 00000020: 0111ee441f90491dbb1f00000000a0027210b68f0000020405b40… length 90, checksum 0x0000 06:29:30:240729: tap6-output 06:29:30:240666: ethernet-input tap6 l4-cksum-computed l4-cksum-correct l2_hdr_offset_valid … frame: flags 0x3, hw-if-index 1, sw-if-index 1 IP4: 02:fe:ea:12:8f:df -> 02:fe:15:ed:70:20 IP4: 00:50:56:a7:33:9d -> 00:50:56:a7:4e:c7 TCP: 10.1.0.130 -> 10.0.1.17 ... tos 0x00, ttl 61, length 60, checksum 0xa26d 06:29:30:240690: vxlan4-input fragment id 0x81bb, flags DONT_FRAGMENT VXLAN decap from vxlan_tunnel0 vni 10 next 1 error 0 TCP: 60996 -> 8080 ... seq. 0x491dbb1f ack 0x00000000 flags 0x02 SYN, tcp header: 40 bytes window 29200, checksum 0xb68f

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 97 All good things must come to an end… Key Takeaways

• Embrace the change and ride the dragon! • Learn to love iptables KEEP • The tale of 3 CNIs CALM • Trace, trace, trace! AND • It's Easy! PUT A LID ON THAT CONTAINER

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 99 Complete your online session • Please complete your session survey survey after each session. Your feedback is very important.

• Complete a minimum of 4 session surveys and the Overall Conference survey (starting on Thursday) to receive your Cisco Live t-shirt.

• All surveys can be taken in the Cisco Events Mobile App or by logging in to the Content Catalog on ciscolive.com/emea.

Cisco Live sessions will be available for viewing on demand after the event at ciscolive.com.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 100 Continue your education

Demos in the Walk-In Labs Cisco Showcase

Meet the Engineer Related sessions 1:1 meetings

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 101 Thank you

Linux Kernel Foundational Blocks for Containers

chroot: change of root directory Namespaces: lightweight cgroups (Control Groups): Resource for the current running process process isolation. Management solution providing a generic and its children. process-grouping framework. Enable a process (or several A program that is run in chroot processes) to have different Cgroups limit, account for, and isolate the cannot access files outside the views of the system than resource usage (CPU, memory, disk I/O, designated directory tree. other processes. network, etc.) of a collection of processes.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 104 After installing Docker, a new Linux bridge called docker0 gets installed on the host.

$ ip a 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:1a:64:21:bb:68 brd ff:ff:ff:ff:ff:ff inet 5.0.7.249/24 brd 5.0.7.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::21a:64ff:fe21:bb68/64 scope link valid_lft forever preferred_lft forever 3: docker0: mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:11:5d:f5:c5 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 scope global docker0 valid_lft forever preferred_lft forever inet6 fe80::42:11ff:fe5d:f5c5/64 scope link valid_lft forever preferred_lft forever $

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 105 tcpdump can help a bit…

eth0 10.100.0.2 • Capturing packets can give insight on how are packets docker0 processed, but only if you know what to look for and 172.17.0.1 how to clean up the "noise"

• Usually start by capturing inside the container and on veth0 172.17.0.2 the physical NIC

container 1 • Then proceed with "internal" points, but this is possible only on linux kernel devices (taps,veth pairs, nics…) Host

• Preferred way is:

$ [nsenter –t 1234 –n] tcpdump -ennali host 172.17.0.2 port 80 host 172.17.0.2 and port 80 port 80 and not port 22

…but what do you do when you need to know more?

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 106 Kube-proxy flow

$ $kubectl kubectl$ kubectlexposerunscale … … … servicesnewand & endpoints!endpointsendpoint!service!

kube-proxy updatewatch kube-apiserver Node X wget http://10.96.176.68 configure

10.96.176.68 iptables scheduleschedule

busybox -> 10.2.0.171 -> 10.2.1.41 10.2.0.5 -> 10.2.1.42 ClusterIP

nginx nginx nginx 10.2.0.171 10.2.1.41 10.2.1.42

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 107 For your CNI config sample in /etc/cni/net.d/ reference

{ Tells the plugin what version the "cniVersion": "0.3.0", caller is using "name": "mynet", "plugins": [ "type": "my-plugin", The caller should look for a "some-parameter": "foo", plugin with this name in "ipam": { /opt/cni/bin/ "type": "host-local", "subnet": "10.42.0.0/24", } ] First plugin will call the IPAM } plugin with this name

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 108 For your Calico deployment modes reference

• Deploying with the K8s API datastore — 50 nodes or less • Only calico-node pods are deployed • Direct comms to K8s API

• Deploying with the K8s API datastore — more than 50 nodes* • Calico-node + Typha pods • Typha sits between the datastore (K8s API server) and many instances of Felix for scaling

• Deploying with the etcd datastore • Calico-node + Calico-etcd pods • etcd can already handle high scale *Mode used by CCP

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 109 For your Calico health reference • Check pod status and logs:

$ kubectl get pod -n kube-system | grep calico calico-node-4pddf 1/1 Running 0 2d18h calico-node-7znqf 1/1 Running 0 2d18h calico-node-j4h94 1/1 Running 0 2d18h calico-node-xqrtb 1/1 Running 0 2d18h calico-typha-85897bbbbf-7vjrh 1/1 Running 0 2d18h calico-typha-85897bbbbf-rnxww 1/1 Running 0 2d18h calico-typha-85897bbbbf-xdvtj 1/1 Running 0 2d18h

$ kubectl logs -n kube-system calico-node-4pddf bird: Mesh_192_168_1_20: Initializing bird: Mesh_192_168_1_20: State changed to startbird: device1: State changed to up 2019-05-07 12:37:59.799 [INFO][38] int_dataplane.go 645: Received interface update msg=&intdataplane.ifaceUpdate{Name:"tunl0", State:"up"} 2019-05-07 12:37:59.799 [INFO][38] int_dataplane.go 504: Linux interface addrs changed. addrs=set.mapSet{"10.11.3.1":set.empty{}} ifaceName="tunl0" 2019-05-07 12:37:59.799 [INFO][38] int_dataplane.go 751: Applying dataplane updates 2019-05-07 12:37:59.799 [INFO][38] hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"tunl0", Addrs:set.mapSet{"10.11.3.1":set.empty{}}}

• Check the CNI logs (part of kubelet): $ journalctl -u kubelet --no-pager | egrep -i "cni|calico" May 04 15:06:10 ccp1-320-cp-mastera6e2625acc kubelet[2852]: 2019-05-04 15:06:10.912 [INFO][5907] k8s.go 361: Populated endpoint ContainerID="5e1a3825d2388f58b67b343cac975f2baafe023a5708f93e4e390d3d1f992445" Namespace="kube-system" Pod="tiller-deploy-5f96fb458f-sznk8" WorkloadEndpoint="ccp1--320--cp--mastera6e2625acc-k8s-tiller--deploy--5f96fb458f--sznk8-eth0" endpoint=&v3.WorkloadEndpoint{TypeMeta:v1.TypeMeta{Kind:"WorkloadEndpoint", APIVersion:"projectcalico.org/v3"}, ObjectMeta:v1.ObjectMeta{Name:"ccp1- -320--cp--mastera6e2625a... May 04 15:06:10 ccp1-320-cp-mastera6e2625acc kubelet[2852]: Calico CNI using IPs: [10.11.0.6/32] May 04 15:06:10 ccp1-320-cp-mastera6e2625acc kubelet[2852]: 2019-05-04 15:06:10.936 [INFO][5907] k8s.go 388: Added Mac, interface name, and active container ID to endpoint ContainerID="5e1a3825d2388f58b67b343cac975f2baafe023a5708f93e4e390d3d1f992445" Namespace="kube- system" Pod="tiller-deploy-5f96fb458f-sznk8...

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 110 ACI CNI Solution Overview

Network Policy Kubernetes Technical Description

• Network policies of Kubernetes supported using standard upstream format but enforced through OpFlex / OVS using APIC Host Protection Profiles ACI Policies • Kubernetes app configurations can be moved without modification to/from ACI and non-ACI environments • Embedded fabric and virtual switch load balancing • PBR in fabric for external service load balancing • OVS used for internal service load balancing • VMM Domain for Kubernetes • Stats per namespace, deployment, service, pod OpFlex OVS OpFlex OVS • Physical to container correlation Node Node

OVS in the data-path, K8s nodes become part of the packet-transport fabric

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 111 Why Use ACI CNI?

Fast, easy, Turnkey solution for Flexible policy: Hardware-accelerated: secure and node and container Native platform Integrated load connectivity policy API and ACI balancing scalable policies networking for your Application Container Platform

Visibility: Live statistics in APIC Enhanced Multitenancy and per container and health metrics unified networking for containers, VMs, bare metal

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 112 Container to Non-Container Communications

• In production environments certain services like high performance databases will be running as VMs or Bare Metal Servers • This calls for the ability to easily provide and control communication between Kubernetes PODs and VMs/Bare Metal endpoints • Simply deploy a contract between your EPGs, ACI will do the rest! • This works for any VMM domain and Physical Domains, for example you can have a Container Domain using VxLAN speaking with a Microsoft SCVMM Domain using VLAN. • ACI CNI also provides flexible SNAT functionality – more friendly for firewalls

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 113 ACI CNI Architecture

host-agent container ACI Leaf • Receives updates from kube-apiserver about new kube-apiserver Switch opflex-proxy endpoints and services and writes it to files (via "kubernetes" svc)

Tenant traffic Endpoint (VLAN or VxLAN) Information EP/SVC files

OpFlex policy • Located in /var/lib/opflex-agent-ovs/endpoints/ (ACI infra VLAN) • Storage for host local endpoint and service information

OpenFlow host-agent opflex-agent Open vSwitch opflex-agent container • Reads EP/SVC files EP/SVC Pod1 Pod2 • Downloads ACI policy (EPGs, contracts) using OpFlex Files K8s node protocol from the ACI fabric via opflex-proxy • Programs Open vSwitch via OpenFlow

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 114 OVS data-path details

• ACI CNI creates two OVS bridges: br-access and br-int podX podY • The K8s pods are be attached to br-access via a eth0 eth0 dedicated veth-pair vethxxx vethyyy • For each pod a dedicated pa-pi link is created using the Network policy br-access OVS patch feature enforcement pa-vethxxx pa-vethyyy • br-access does what is outside of ACI model (Network

Policy) pi-vethxxx pi-vethyyy NAT, LB, br-int routing and • br-int does everything else - regular ACI leaf switch encapsulation functions (routing, encap, policy, LB) + NAT br-int_vxlan0 • If the packet destination is outside of the host, the br-int K8s node will take care to encapsulate the packet in to a VLAN or bond VxLAN (based on the user definition) and forward it to the ACI fabric.

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 115 For your Contiv-VPP Architecture reference

✓ Complete container networking solution entirely from user space. - enables rapid upgrades, highly available, no kernel tax ✓ Option to replace all eth/kernel interfaces with high performance/scale memif/users space interfaces - high shared memory performance / scale ✓ VPP TCP stack for higher performance (bypass kernel host stack and use VPP TCP stack) - Istio envoy sidecar high performance / scale ✓ Fully supports legacy apps that use the kernel host stack in the same architecture - enhanced monitoring / debugging for existing apps

K8s Master

High Performance Legacy Cloud- Contiv Cloud- Legacy High Performance Apps Apps Native VNFs Netmaster Native VNFs Apps Apps

PodPod PodPod PodPod PodPod PodPod PodPod Pod Pod Pod Kubelet Kubelet Pod Pod Pod Istio Envoy App App VNF Contiv VNF App App Istio Envoy Etcd

VPP VPP Kernel Host stack CNI K8s policy & state CNI Kernel Host stack TCP CRI CRI TCP memif memif Stack tapv2 distribution tapv2 Stack

VPP Agent Agent VPP Contiv vSwitch … Contiv vSwitch IPv4/IPv6/SRv6 Network

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 116 Contiv-VPP Components K8s State Reflector: • Proxy between K8s API objects and Contiv's own ETCD

Steal The NIC: • VPP usually uses its own HW NIC, host too • On single NIC hosts • STN “steals” NIC before startup and configures it with the same IP address as the previously established host stack

Contiv vSwitch (VPP+Agent): • Core network component • Kernel interfaces to pods and host stack; user space i/f hardware NIC • Contiv-VPP agent receives updates from Contiv ETCD and renders VPP config

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 117 For your Contiv-VPP components in K8s pods reference

$ kubectl get pod -n kube-system -o wide | grep contiv contiv-crd-th95x 1/1 Running 0 5d2h 192.168.1.54 ccp2-tenant1-master7677945a00 contiv-etcd-0 1/1 Running 0 5d2h 192.168.1.54 ccp2-tenant1-master7677945a00 contiv-ksr-2hl7k 1/1 Running 0 5d2h 192.168.1.54 ccp2-tenant1-master7677945a00 contiv-vswitch-tbkk9 1/1 Running 0 5d2h 192.168.1.50 ccp2-tenant1-worker5afd21cebf contiv-vswitch-wqbrz 1/1 Running 0 5d2h 192.168.1.52 ccp2-tenant1-workerc166f1be4a

$ sudo docker ps | grep contiv-vswitch 31b3167170d3 1640a3c3c4b9 "/usr/bin/contiv-init" 5 days ago Up 5 days k8s_contiv-vswitch_contiv-vswitch-7qqvf_kube-system_353bcc8e-0f74-11ea-ad8e-005056a76232_0 0cb546792149 registry.ci.ciscolabs.com/cpsg_ccp/k8s.gcr.io/pause:3.1 "/pause" 5 days ago Up 5 days k8s_POD_contiv-vswitch-7qqvf_kube-system_353bcc8e-0f74-11ea-ad8e-005056a76232_0

$ sudo docker inspect 31b3167170d3 | grep Pid "Pid": 10130, "PidMode": "host", "PidsLimit": 0,

$ pstree 10130 contiv-init─┬─contiv-agent───12*[{contiv-agent}] ├─vpp_main───{vpp_main} └─10*[{contiv-init}]

In case of control-plane issues check Contiv-VPP pod logs!

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 118 For your Contiv-VPP Detail View reference

Pod 1 Pod 2 … Pod n Pod n+1 … Pod n+m 10.2.1.2/32 10.2.1.3/32 10.2.1.n/32

tap1 tap2 tapn PodSubnetCIDR: unnumbered unnumbered unnumbered 10.2.1.0/24 loop0 10.2.1.1/24 Host Stack Default IPAM Config: VxlanCIDR: 192.168.30.1/24 • PodSubnetCIDR: 10.2.1.0/24 192.168.30.0/24 loop1 VRF1 • PodNetworkPrefixLen: 24 BVI • VPPHostSubnetCIDR: 172.30.0.0/16 Bridge Domain #1 • VPPHostNetworkPrefixLen: 24 172.30.1.1/24 172.30.1.2/24 • NodeInterconnectDHCP: False VPP tap0 vpp1 • NodeInterconnectCIDR: 192.168.16.0/24 192.168.16.1/24 • VxlanCIDR: 192.168.30.0/24 NodeInterconnectCIDR: VRF0 10.20.0.2/24 192.168.16.0/24 GigabitEthernet0/8/0 enp0s9

Host NIC NIC

VPPHostSubnetCIDR: 172.30.0.0/16

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 119 contiv-netctl $ kubectl exec -ti -n kube-system contiv-crd-th95x -- sh / # ./contiv-netctl ipam INFO[0000] Connected to Etcd (took 12.225246ms) endpoints="[127.0.0.1:32379]" loc="etcd/bytes_broker_impl.go(60)" logger=defaultLogger ID NODE-NAME VPP-IP BVI-IP POD-CIDR VPP-2-HOST-CIDR POD-CLUSTER-CIDR 1 ccp2-tenant1-master7677945a00 192.168.1.49 10.2.2.1 10.0.0.128/25 10.1.0.128/25 10.0.0.0/16 2 ccp2-tenant1-worker5afd21cebf 192.168.1.51 10.2.2.2 10.0.1.0/25 10.1.1.0/25 10.0.0.0/16 3 ccp2-tenant1-workerc166f1be4a 192.168.1.53 10.2.2.3 10.0.1.128/25 10.1.1.128/25 10.0.0.0/16 / # ./contiv-netctl nodes INFO[0000] Connected to Etcd (took 16.703382ms) endpoints="[127.0.0.1:32379]" loc="etcd/bytes_broker_impl.go(60)" logger=defaultLogger ID NODE-NAME VPP-IP HOST-IP START-TIME STATE BUILD-VERSION BUILD-DATE 1 ccp2-tenant1-master7677945a00 192.168.1.49 192.168.1.54 Mon Nov 25 11:11:21 2019 OK v3.3.1 Fri Sep 20 12:50:00 2019 2 ccp2-tenant1-worker5afd21cebf 192.168.1.51 192.168.1.50 Mon Nov 25 11:11:46 2019 OK v3.3.1 Fri Sep 20 12:50:00 2019 3 ccp2-tenant1-workerc166f1be4a 192.168.1.53 192.168.1.52 Mon Nov 25 11:11:47 2019 OK v3.3.1 Fri Sep 20 12:50:00 2019 / # ./contiv-netctl pods ccp2-tenant1-master7677945a00 INFO[0000] Connected to Etcd (took 13.737773ms) endpoints="[127.0.0.1:32379]" loc="etcd/bytes_broker_impl.go(60)" logger=defaultLogger POD-NAME NAMESPACE POD-IP IF-IDX IF-NAME contiv-crd-th95x kube-system 192.168.1.54 contiv-etcd-0 kube-system 192.168.1.54 contiv-ksr-2hl7k kube-system 192.168.1.54 contiv-vswitch-7qqvf kube-system 192.168.1.54 coredns-77799f956f-d8hr7 kube-system 10.0.0.132 5 tap1 coredns-77799f956f-l96vs kube-system 10.0.0.133 8 tap2 etcd-ccp2-tenant1-master7677945a00 kube-system 192.168.1.54 ... # ./contiv-netctl vppcli ccp2-tenant1-master7677945a00 show ver INFO[0000] Connected to Etcd (took 14.566961ms) endpoints="[127.0.0.1:32379]" loc="etcd/bytes_broker_impl.go(60)" logger=defaultLogger vppcli ccp2-tenant1-master7677945a00 show ver "vpp v19.08-17~g623a1b705 built by root on 2c5b598eddff at Fri Sep 20 08:25:22 UTC 2019

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 120 ClusterIP load balancing data path Service: 10.103.233.222:80

Client Replica Replica Replica 10.1.1.3 10.1.1.4:8080 10.1.1.5:8080 10.1.1.6:8080

src: 10.1.1.3:40000 dst: 10.103.233.222:80 src: 10.1.1.5:8080 dst: 10.1.1.3:40000

tap tap tap tap

src: 10.103.233.222:80 src: 10.1.1.3:40000 dst: 10.1.1.3:40000 dst: 10.1.1.5:8080

NAT plugin NAT plugin REQ: LB+DNAT REQ: FW RESP: SNAT RESP: FW

src: 10.1.1.3:40000 src: 10.1.1.5:8080 dst: 10.1.1.5:8080 dst: 10.1.1.3:40000 GbE VPP GbE VPP

network

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 121 NodePort/LB on the same node Service: 10.103.233.222:80 Host 172.30.1.2 Replica Replica Replica src: 172.30.1.2:40000 dst: 10.103.233.222:80 src: 10.103.233.222:80 10.1.1.4:8080 10.1.1.5:8080 10.1.1.6:8080 dst: 172.30.1.2:40000

iptables (kube-proxy) REQ: LB+DNAT RESP: SNAT src: 10.1.1.4:8080 dst: 172.30.1.2:40000 src: 172.30.1.2:40000 tap dst: 10.1.1.4:8080

tap tap tap tap

src: 10.1.1.4:8080 dst: 172.30.1.2:40000

NAT plugin REQ: FW NAT plugin RESP: FW src: 172.30.1.2:40000 dst: 10.1.1.4:8080

GbE VPP GbE VPP

network

BRKCLD-3181 © 2020 Cisco and/or its affiliates. All rights reserved. Cisco Public 122