June 15, 2020 Informatica — Universiteit van Amsterdam
Supervisor(s): Msc Sara Shakeri, dr. Paola Grosso Signed: Signees page 2 Abstract Containers are very popular these days and there are several software tools by which a container can be created. To connect containers the most popular method is utilizing an overlay network method. A container overlay network is easy to deploy. The drawback is that with creating an overlay network each packet will get an extra layer of encapsulation. We try to find an alternative for the overlay solution by creating connectivity between docker containers using a p4 switch. A TCP connection is created between local containers. Currently, there is no TCP connection possible between two containers on a different host. We narrowed down what the problem could be. Our findings show that the problem occurs at L4. L3 and L4 (concerning TCP) connections between Docker containers on the same host are possible. For containers on different hosts (After NAT is applied) L3 connectivity is possible but the connection at L4 is refused.
page 3 page 4 Contents
1 Introduction 7 1.1 Introduction...... 7
2 Theoretical background9 2.1 Process isolation...... 9 2.1.1 Namespaces...... 9 2.2 Containers...... 10 2.2.1 Linux container...... 10 2.2.2 Docker containers...... 10 2.3 p4 switches...... 11 2.3.1 Defenition of a simple p4 program...... 12 2.3.2 Deployment of a p4 program...... 15 2.4 Overlay network...... 16
3 Related work 17 3.1 Connectivity of docker containers...... 17 3.2 Usage of p4 in SDN...... 18
4 Method 19 4.1 Establishing connectivity between containers using a p4 switch...... 19 4.1.1 Connecting a Docker container to a p4 switch...... 19 4.2 TCP connection between containers using a p4 switch...... 20 4.2.1 Network topologies...... 24 4.2.2 TCP connection between containers in a single host...... 26 4.2.3 TCP connection between containers in multiple hosts...... 30
5 Discussion 33 5.1 Multiple containers on the same host...... 33 5.2 Single container, multiple hosts...... 34 5.3 Firewall settings...... 34
Containers are very popular these days and there are several software tools by which a container can be created. Like Docker. A container is just another way of virtualization. Despite virtual machines that use separated hardware and on top of that separated operating system, in con- tainers, the virtualization is at OS level and they share the host’s kernel. Therefore, comparing with Virtual Machines (VMs) the number of resources that are utilized in container technologies is much less. [8]. Another advantage of containers is that they are quick to set up and easy to deploy[8]. Containers can also run on any distribution not dependent on the hardware of the distribution. To connect containers the most popular method is utilizing overlay network methods. A container overlay network is easy to deploy. The drawback is that with creating an overlay network each packet will get an extra layer of encapsulation. This decreases performance. Since there is more and more data send through the network nowadays, it’s important to minimize the overhead where possible. The challenge is to create a network of containers, with the benefits of containers but without or at least with reduced overhead. We will try to accomplish this by implementing a p4 switch. A p4 switch is a switch from which we can program the data plane. With a p4 switch, we have more control over the incoming packets and how to forward them. P4 allows building more intelligent control mechanisms with regards to packets. To implement the p4 switch we have to find a way to change the docker default settings making a container more flexible and giving us more control over a container. To do this we have to be careful not to lose the advantages of a docker container which docker provides. To answer the research question
• how can we minimize the overhead in Docker container networks by using mod- ern technologies such as p4 switches?
First, we will focus on establishing connectivity between containers. In the chapter 2 (background) and chapter 3 (related work), we will show how a container can be con- nected to a p4 switch. In the chapter 4 (method), we will examine several implementations with two network settings, for a docker container, to set up a TCP connection between containers. The method chapter focusses on the question: – how can we create connectivity between containers by utilizing a p4 switch?
In the chapter 5 (Discussion) we will discuss and evaluate the results after which will we draw some conclusions and propose some topics for future work in the chapter 6 (Conclusion).
Around ’79 the chroot command was introduced in Linux systems. The chroot command on Unix operating systems is an operation that changes the apparent root directory for the current running process and its children. When you change root to another directory you cannot access files and commands outside that directory. With the chroot command we could create the impression of an isolated process. Chroot simply modifies pathname lookups for a process and its children, prepending the new root path to any name starting with /. The current directory is not modified and relative paths can still have access to any location outside of the new root. This is seen as the start of process isolation which has led to the containers of today.
2.1.1 Namespaces Namespaces are used to isolated different sets of global resources and restrict access to an inner process to its sandbox. There are seven types of namespaces [9] of which three are of special importance concerning a container: • Network namespace (netns) • Control group namespace (cgroup) • Process ID (pid)
Network namespace Network namespaces visualize the network stack. On creation, a network namespace contains only a loopback interface. Each network interface (physical or virtual) is present in exactly one namespace and can be moved between namespaces. A network namespace isolates net- work devices, ports and has a private set of ip addresses, its routing table, ARP table, and other network-related resources. Destroying a network namespace destroys any virtual interfaces within it and moves any physical interfaces within it back to the initial network namespace.
Initially, all the processes share the same default network namespace from the init process. By convention when a named network namespace is created it’s an object at /var/run/netns/name that can be opened and listed by the command ip netns. The network namespace does not belong to any process yet. By attaching the network namespace to the network namespace of a process we isolate the network resources of that process.
Cgroups A control group (cgroup) is a collection of processes that are bound by the same criteria and associated with a set of parameters or limits. These groups can be hierarchical, meaning that each
9 2.2. CONTAINERS group inherits limits from its parent group. The kernel provides access to multiple controllers (also called subsystems) through the cgroup interface A cgroup namespace virtualizes the contents of the /proc/self/cgroup file. Processes inside a cgroup namespace are only able to view paths relative to their namespace root.
PID namespace PID namespaces are nested, meaning when a new process is created it will have a PID for each namespace from its current namespace up to the initial PID namespace. Hence the initial PID namespace can see all processes, but with different PIDs than other namespaces will see processes with.
2.2 Containers
2.2.1 Linux container Linux Containers is an virtualization method for running multiple isolated Linux systems (con- tainers) on a host using a single (Linux) kernel. The Linux kernel provides the cgroups functionality that allows limitation and prioritization of resources (CPU, memory, block I/O, network, etc.) without the need for starting any virtual machines, and also namespace isolation functionality that allows complete isolation of an appli- cation’s view of the operating environment, including process trees, networking, user IDs and mounted file systems. LXC combines the kernel’s cgroups and support for isolated namespaces to provide an isolated environment for applications. A container is a process from which different namespaces are isolated by attaching different customized namespaces to the namespaces of the process. Each process has its own idea of its root. By default, the root of a process is set to the system root. We can change the root of a process with the chroot command. When we create a container we create it from the context of the current process where from where the command is executed. This means that by default the root of the container is set to the root of the process from which the container was created. Today’s container makes use of two features of the kernel: cgroups and namespaces. Before cgroups every process resources were managed individually. Applications that had more processes running using were using more system resources. With cgroups we can create a subgroup and assign a certain amount of system resources to that subgroup. All sub-processes belonging to that subgroup can only use the resources that were assigned to their subgroup they belong too. If we list (command systemd-cgls cpu) the cgroups we see that for each container a subgroup is created. To manage the resources manually we can go to the directory /sys/fs/cgroup/. In this directory, we find the three main resources cpu, memory and blocking i/o. We can enter one of these directories and we will find the user.slice which manages the resources for user tasks and system.slice which manages the resources for system services. In case if docker is installed we will also find a docker directory that manages the resources concerning docker devices.
2.2.2 Docker containers Docker containers look a lot like network namespaces. But when we install docker there is also a new cgroup created. All the containers created by docker will belong to the docker subgroup/cgroup. Docker will set up some default configuration files but we can find the id of the container and go to /sys/fs/cgroup/cpu/containerID. and manage the cpu resources for that container manually. We can do the same for other system resources. Besides creating a cgroup docker also creates a network namespace each time we created a new container. Usually, if a new network namespace is created we can see the name of the network namespace if we list all the newly created network namespaces with the command ip netns. The network namespace of a docker container though will not appear in this list. This is because docker removes all the network namespaces for their containers from the /var/run/netns/ folder and manages them from /proc/containerID/ns/net file. page 10 CHAPTER 2. THEORETICAL BACKGROUND 2.3. P4 SWITCHES
Figure 2.1: Pipeline of a p4 switch
By default, all containers have the PID namespace enabled. PID namespace provides sepa- ration of processes. The PID Namespace removes the view of the system processes and allows process ids to be reused including pid 1. In certain cases you want your container to share the host’s process namespace, basically allowing processes within the container to see all of the pro- cesses on the system. For example, you could build a container with debugging tools like strace or gdb, but want to use these tools when debugging processes within the container. [5].
2.3 p4 switches
In this section, we will explain the basics of a p4 program. A p4 switch is a switch from which we can program the data plane with the domain-specific language p4. The key elements of a p4 switch/program are: the parser/deparser, ingress/egress control, match tables, actions and the CLI. These are software elements, therefore for an element to exists it must be defined in code. P4 is the language that is used to define these elements definitions. The elements are regular code constructs that can be interpreted by the p4 com- piler. A programmer is bounded by the set of elements which can be interpreted by the p4c-bmv2 compiler just as in any other programming language. One main difference with other programming languages is that p4 does not know the repetition construct. Also, the if statement can’t be used everywhere. P4 does know the select construct, which can be thought of as the equivalent of an if-statement.
A p4 switch can be considered as a pipeline (as is shown in figure 2.1). The pipeline has three stages in which a packet an be controlled by the switch. These stages are the parser, ingress, egress. The parser defines which layers of the packets must be parsed. Notice that the switch can only access the parameters from the headers/layers that have been parsed. Before a L2 device would only look at L2 of a packet. With p4 all layers can be accessed. When the parser is finished the packet will be sent to ingress. In the ingress construct will be specified which matching tables will be used to apply on the packets. A table definition contains the characteristics of a packet and the action(s) it must apply to this packet. When the table sees a packet with this set of characteristics the action(s) are applied to this packet.
The action is a construct where can be specified what has to happen to a packet when it has met the criteria of a table. Once the switch has applied the action(s), the packet will be sent to egress. Also at egress tables and their set of actions can be applied to packets. Once a packet is finished at egress the deparser will reassemble the packet and send the packet to a specified port. In p4, new header types can be specified as well as metadata objects. Metadata can be used to store (as long the packet is in the pipeline) information about the packet. The standard metadata object is already defined by default. Metadata can be accessed by the switch without parsing the packet. The standard metadata contains information like at which port did the packet arrive. To forward a packet, an action must be specified which sets the egress parameter of the standard metadat object. Not setting the egress port will lead to packet loss. The whole standard metadata object can be found in the Appendix A.1. The values on which a table has to match packets can be set through rules added to the CLI. The format of a rule is given by: table add
... => . s not consider the container configuration that is needed to bind a container to the switch.
2.3.1 Defenition of a simple p4 program A p4 program starts with a definition of it’s headers we want the parse. After parsing we have access to each header with it’s parameters.
The packet will enter the pipeline in the parser start block. We have two options or we extract the ether hdr or we send the packet to the ingress control construct. If we don’t specify anything no op (no operation) is applied which results in the packet not being forwarded. In this example, we check what the type is of the next header and parse the header accordingly. After we have parsed a header we can access its parameters in the following code.
1 # Here we parse the packet headers. We start with extracting the first header. 2 # Since we have defined the ether_hdr we can use its parameters to performa 3 # select in this case. 4 5 parser start { 6 # after line7 we can access all the parameters of the defined ether_hdr. 7 extract(ether_hdr); 8 return select(ether_hdr.ethertype) { 9 0x0800 : ipv4; 10 0x86dd : ipv6; 11 } 12 } 13 14 parser ipv4 { 15 # after line 16 we can access all the parameters of the defined ipv4_hdr. 16 # We can use these parameters in the ingress, table, action and egress constructs. 17 extract(ipv4_hdr); 18 return ingress; 19 } 20 21 parser ipv6 { 22 extract(ipv6_hdr); 23 return ingress; 24 } Listing 2.2: Code block parser. Resembled by Parser in the pipeline as shown in figure 2.1
After a packet has been parsed the packet is sent to the ingress control block. In the control ingress block, we have to define which table(s) we want to apply on a packet and in which order. If a packet does not match (miss) the table the next table (if there is a next table) will be applied. If none of the tables result in a hit the packet will be dropped by default.
1 control ingress { 2 apply(forwarding_table); 3 apply(forwarding_router_table); 4 5 } Listing 2.3: Ingress control block
In the table construct, we have to state which data we want to match on and how we want to match this data. In the example of Listing 2.4, we match on the destination ip address and it has to be an exact match.
1 table forwarding_table { 2 reads { 3 ipv4_hdr.dst : exact; 4 } 5 actions { 6 set_egress; 7 drop_packet; 8 } 9 } Listing 2.4: Match Action table. In the pipeline a packet will be matched to this table and (in case of a hit) send to the buffer (see figure 2.1
Other types of matches are lpm (longest prefix match), and ternary. The type of matching determines how we have to pass the argument to the CLI. In case of an exact match we have to pass an exact ip address and in case of lpm/ternary we pass an ip address and it’s mask or ip address/mask. The actions defined in the actions construct of the table can be triggered by a match. Only the stated actions can be applied.
In this example, we have two actions. This means we can only trigger (in case of a hit) set egress or drop packet.
1 action set_egress(port) { 2 # set the egress port thourgh which the packet has to leave the switch. 3 modify_field(standard_metadata.egress_spec, port); 4 } 5 6 action drop_packet() { 7 # drop() isa built in function which drops te packet. 8 drop (); 9 } Listing 2.5: Defenition of a action
Through the CLI we pass the arguments for the match table and actions. I.e. we have a table forwarding table which matches on the source ip with action set egress: in the CLI we would set the arguments like:
1 2 table_add forwarding_table set_egress ip_address => egress_port 3 # if we want to forward all traffic from ip address 192.168.174.106 to port3 4 table_add forwarding_table set_egress 192.168.174.106 => 3 Listing 2.6: Defenition of a action
In the above example, 3 will be the argument (port) for the set egress action.
2.3.2 Deployment of a p4 program The process of deploying a new p4 program is shown in figure 2.2. First, a p4 file must be defined. Then, the file has to be compiled. This will result in a JSON file that can be run by the switch. Depending on the topology we have to move the file to the VM(s) where the file must be deployed. CLI rules have to be written, the interfaces that have to be bounded to the switch have to be found (the code how to find a docker veth is shown in Appendix A.2) after which the switch can be set up. The CLI rules (as discussed in the previous section) are implementation dependent. The argu- ments passed through the rules depend on the L2, L3 and L4 addresses used by the containers. If an address changes, all CLI rules that rely on that address have to be changed. To do so the invalid rules must be deleted manually after which the new rules can be (manually) added through the CLI. There must be tshark/wireshark queries written to analyze if the implementa- tion is behaving as we expect it to behave. Once all queries are written the switch can be set up, after which the rules can be added through the CLI. For all hosts in the network, the same steps have to be repeated. The experiments can start after all hosts are up. When the implementation has to change the deployment process starts from step one. This deployment graph does not consider the container configuration that is needed to bind a container to the switch.
Overlay network An overlay network runs on top of another network to build customized virtual links between nodes. Common forms of overlay network include IPIP, virtual extensible LAN (VXLAN), virtual private network (VPN), etc. There exist many overlay networks for Docker containers. Although they differ in implementation, the key idea is similar. Containers save the mapping between their private IP addresses and their host IP in a key-value (KV) store, which is accessible from all hosts. Containers use private IP addresses in a virtual subnet to communicate with each other. The overlay inserts an additional layer in the host network stack. When a packet is sent by a container, the overlay layer looks up the destination host IP address in the KV store using the private IP address of the destination container in the original packet. It then creates a new packet with the destination host IP address and uses the original packet sent by the container as the new packet’s payload. This process is called packet encapsulation. Once the encapsulated packet arrives at the destination host, the host network stack decapsulates the wrapped packet to recover the original packet and delivers it to the destination container using the container’s private IP address.
page 16 CHAPTER 2. THEORETICAL BACKGROUND CHAPTER 3 Related work
3.1 Connectivity of docker containers
In the research of [12] we see that we can set the driver of a container. The driver determines the visibility and connectivity of a container. In the docker documentation, there are three op- tions listed: host, bridge and none. The host option infers that a container shares the host’s networking namespace, and the container does not get its own IP-address allocated [5]. A con- tainer created in bridge mode will use a bridge network. The bridge network uses a software bridge that allows containers connected to the same bridge network to communicate while pro- viding isolation from containers which are not connected to that bridge network [5]. Containers created with the none mode provide no access to the external network as well as for other contain- ers. This mode will not configure any IP for the container. It only contains the loopback address.
Another way to configure the connectivity of containers is to expose ports and add rules to the firewall. From [5] we can read that docker adds three chains to the firewall. The documen- tation recommends only to alter the docker-user chain in case there is a need to filter, change or redirect packets. To expose ports a container can be started with the -p flag. This gives the user the opportunity the specify the port of the container which must be exposed to the host. I.e. docker run -p 127.0.0.1:80:8080/TCP ubuntu bash will expose port 8000 of the container to port 80 of the host 127.0.0.1 and accepts TCP traffic [5]. Ports can also be exposed by changing the docker file. Docker provides also an option to manually add connectivity by using the ip netns command. Ip netns finds the container by using namespaces pseudo-files.The namespaces of a container are materialized under: /proc/$Pid/ns. I.e. the network namespace of a container can be found at: /proc/$Pid/ns/net. When we run ip netns exec containerID , it expects /var/run/netns/containerID to be one of those pseudo-files [5]. To be able to use the ip netns command we have to find the pid of the container. We can find the pid with sudo docker inspect −f 0.State.P id0 containerID. Then we have copy the pseudo file found at /proc//ns/net to /var/run/netns/containerID [5]. From now on we can treat the network namespace of the container containerID like a Linux network namespace created with the ip netns command. The docker overlay network causes extra overhead because the packets get an extra layer of encapsulation in the form of a VXLAN header. A script was produced to avoid all the above steps have to be repeated by hand when there was the need to create a new container. The script can be found in Appendix A.3. The changes to the containers are not persistent. In case the host is restarted the containers do exist but have to be configured again. The configuration, after the host was restarted, is done by the script in Appendix A.3
17 3.2. USAGE OF P4 IN SDN
3.2 Usage of p4 in SDN
P4 is a relatively new language. There is not a lot of research done in general in this area. Espe- cially p4 related to its performance is not widely examined by the science community. Currently, p4 is mostly used to replace, improve SDN functionalities implemented through predecessors of p4 such as OpenFlow/ open vSwitch or create a functionality that was not possible with ear- lier techniques [3][11][6]. P4 was used to program the data plane of an SDN switch to achieve multiple routing configurations in case that a node in the network goes down, the switches in the network would be updated and another routing configuration would be deployed by switches itself [6]. The switches were incorporated in a mininet network. Although the paper does not state which type of switch (i.e. BMV2 or Tofino TM Switch) was used, this does show that p4 can be used to program a SDN switch and forward internet traffic. On the Github repository of the p4-bmv2 development team [2] a bmv2 switch was implemented in a mininet network. In both studies [2][6] NAT was applied only at L2. Nodes in mininet are processes with their own network namespace to isolate the network stack of the process to create the illusion of a separate node. The nodes in this research use the same network namespaces to isolate the network stack of a container. P4 gives the user a powerful tool to control packets. This must not be abused. The number of field writes and the number of tables in a p4 program will increase the latency [4]. The number of tables can increase latency significantly especially when the p4 program is using the bmv2 model [4]. While designing our program we should be aware of the negative impact header alterations, field writes and the number of tables has on the latency of the network. The performance of a SDN switch can also be improved by increasing the TCP window [1]. In this study, the private container addresses will be translated into public addresses of the host of a container using. This involves L4 NAT. Therefore, to find an alternative for the Docker overlay solution it’s important to examine whether a connection can be made between Docker containers when NAT at L4 is applied. None of the found researches combine a p4 switch with Docker containers nor do they apply NAT at L4.
page 18 CHAPTER 3. RELATED WORK CHAPTER 4 Method
4.1 Establishing connectivity between containers using a p4 switch
In our experiments, we used two virtual machines functioning as the host of containers. Each virtual machine is running Ubuntu 18.04 and Linux kernel 4.15.0, and for running containers, we used Docker Community Edition 18.09. We used p4 behavioral model (bmv2) for running p4 switches. P4 programs were compiled with the p4c-bmv2 compiler. In subsection 4.2 we will present the files that were used in the experiments. We run p4 switch in each host, connect the containers of each host to the p4 switch at the same host and then check the connectivity between them.
4.1.1 Connecting a Docker container to a p4 switch By default, a docker container is connected to the docker0 bridge. To be able to connect a container to the switch instead of the docker0 a container should be created with the –network flag set to none like:
1 sudo docker run --name --network=none ubuntu:latest /bin/sleep 1000000 & Listing 4.1: Create container without a network driver
. The above command tells docker to create a container but not to connect it to the bridge. In the next step, we create a veth pair and set one veth in the network namespace of the container. The other end of the veth pair will be connected to the p4 switch when we bring the switch up. To add veth to the container network namespace, the container object has to be visible for the ip netns command. Therefore, we have to add the network namespace of container to the list of network namespaces of the host and make the container visible for the ip netns command. This can be done by executing the commands of listing 4.1:
1 pid=$(sudo docker inspect -f’{{.State.Pid}}’ container_name) 2 sudo mkdir -p /var/run/netns/ 3 sudo ln -sfT /proc/${pid}/ns/net /var/run/netns/container_name Listing 4.2: Make a container visible in the ip netns directory.
From this point, veths can be set to the network namespace of the container. Any ip address can be assigned to the veth that will be set to the network namespace of the container. The other side of the veth pair can be bound to the switch with the flag
1 -i @ Listing 4.3: Bind an interface to a port of the switch
19 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
Now, it’s possible to connect a container to the switch. The next step is to set up a network and to establish L3 and L4 connectivity. This will be discussed in the next sections.
4.2 TCP connection between containers using a p4 switch
The two main types of internet traffic are traffic with an established connection before sending data (TCP) and traffic without an established connection before sending data (UDP). We will first focus on establishing a TCP connection because most traffic involves TCP traffic. Since the main focus of this research is to establish a TCP connection, UDP will not further be discussed.
Three different p4 files were used during this research to establish a TCP connection between containers. From now on we will refer to these files as implementations. Below we will briefly discuss their use and their characteristics, supported by code snippets. The full implementations can be found in the Appendix A.3
• switch1.p4 makes a forwarding decision based on the ingress port. This implementation does not: – extract the headers – NAT – update the checksums switch1.p4 contains one table which matches on the ingress port. Based on the results of this implementation the problem can be narrowed down. No connectivity at L3 nor L4 implies that there might be a problem with the switch itself or the problem is related to the container. Connectivity at only L3 implies that the problem is L4 related. The switch1.p4 implementation relies on a network which has two containers at the same host as shown in section 4.2.1 • switch2.p4 Will parse each packet until L4. NAT is applied on L2, L3 and L4 and the checksum at L3 and L4 is updated. To be able to make a forward decision we need four tables local v4 forwarding table, map ip to port, check dst port and check src port
1 2 control ingress { 3 if (valid(ipv4_hdr)) { 4 apply(local_v4_forwarding_table); 5 } 6 if (valid(TCP_hdr)) { 7 apply(map_ip_to_port); 8 apply(check_dst_port); 9 apply(check_src_port); 10 } 11 } Listing 4.4: Ingress
The table local v4 forwarding table only forwards packets to containers at the same host. If the local v4 forwarding table results in a hit, no NAT has to be applied and a packet can be forwarded to the given egress port. The map ip to port table will match/forward packets that are destined for the containers on the remote host. The remaining two tables match/forward incoming packets from the remote host. When a packet matches one of the last three tables NAT is applied to the packet. After a hit with one of the last three tables shown in Listing 4.4, the action nat (as shown in Listing 4.5) will translate the current addresses into the addresses given through the CLI (see Listing 4.10): page 20 CHAPTER 4. METHOD 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
CHAPTER 4. METHOD page 21 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
When the L3 or L4 addresses of a packet changes and the checksum(s) of that packet become(s) invalid. When the L3 checksum is invalid, a packet will be discarded by the underlay network. In case the L4 checksum is invalid the packet will be considered invalid by the L4 device receiving the packet and the packet will be lost. To avoid packet loss, the checksum of each packet will be updated The checksums of a packet will be updated in the deparser [10]. This will be done by the switch itself. We only have to call update as is shown in Listings 4.6 and 4.7:
With the switch2.p4 implementation we can use the destination port as an identifier for a connection between two containers on a different host. We need the last two tables because in general, we don’t know which host will act as a server and which host will act as a client. Thus, if we set the destination port to 5201 the server has to check the destination port. When this packet is returned by the server the client can find the 5201 as the source port of the packet. • switch3.p4 is a simplified version of switch2.p4. The if statements are removed as well as the local v4 forwarding table. The map ip to port table forwards all traffic to eth0 of the VM. The NAT process is still the same as in switch3.p4 as well as the computation and update of the checksums. We will use this implementation to check if a connection at L3 level is possible between containers at two different hosts. To avoid too many alterations (which can lead to mis- takes) all tables and action names are still the same. This implementation is suited for two hosts, single container topology.
We separate TCP traffic in two different categories. The first category is TCP traffic within the same host. The second category refers to TCP traffic destined for remote containers. We differentiate these two types of TCP traffic because the second category involves NAT as shown in Listing 4.5, where traffic belonging to the first categories is only forwarded by the switch.
CHAPTER 4. METHOD page 23 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
4.2.1 Network topologies This subsection will describe the topologies used during the experiments. Different topologies were used to test for connectivity in a local environment (multiple containers in one host) and a remote environment (single container on two different hosts). The different topologies are used to examine if there is a difference in connectivity between the containers. If there is a difference in connectivity the different implementations (as discussed in section 4.2) to analyze what could be the cause of the problem.
Topology 1: is the most simple. There are two containers connected to one switch. The switch is not connected to the interface of the VM or in this case to the interface of the host. This is done for simplicity and to minimize mistakes since we have written and applied the rules manually. In this topology, we only have to deal with local traffic. As discussed in the previous section, we don’t have to apply NAT to the packets.
Figure 4.1: Single node setup, with multiple containers in host
page 24 CHAPTER 4. METHOD 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
Topology 2: is an extension of the first topology. There are two hosts (VMs). On each host, there is a container and a switch. The container is connected to the switch. The switch is connected to the interface of the VM. The VMs are connected by the underlay. The third topology.
Figure 4.2: Multi node setup, with one container per host
There are two options to switch between topologies. One, change the implementation. An implementation can be writing to support one specific topology i.e. switch1.p4. Since switch1.p4 can only look at the ingress port there can only exist one path between two containers. Since switch1.p4 does not apply NAT it can only support the topology shown in figure 4.1. Another method to switch between topologies is to write an implementation like switch2.p4 and connect two containers by adding new rules. With the second method, the topology can be changed while the switch is up. The disadvantage is that the file rules that have to be added are more complex. The second method increases the use of tables. As is discussed in chapter 3, an increase in tables affects the performance of the switch.
CHAPTER 4. METHOD page 25 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
4.2.2 TCP connection between containers in a single host • setup 1: In this setup we use the switch1.p4 implementation and create docker containers with the network driver specified as none. We used the two containers single host topology.
Figure 4.3: Results switch1.p4 implementation with –network=none for two containers single host
The results show that there is connectivity at L3 but there is no connectivity at L4. • setup 2: We will use the switch1.p4 implementation and create the containers with the default network driver. Both containers are connected to docker0. Then we pinged the containers after which we started an iperf3 session from within the containers. At L3 there was connectivity but at L4 there was no connection. The CLI rules were the same as in setup 1
page 26 CHAPTER 4. METHOD 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
Figure 4.4: Results switch1.p4 implementation with –network=none for two containers single host
• setup 3: For this setup we used the switch3.p4 implementation with the two containers, single host topology. The CLI rules used were: We created the containers with the network flag –network=none.
Figure 4.5: Results setup 3
CHAPTER 4. METHOD page 27 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
From figure 4.6 we can see that there is connectivity between the containers at L3 and L4. • setup 4: We will use the same implementation but we create new containers with the default network option. The containers will be connected to the docker0 bridge. We remove the interface from docker0 and find (with find veth.sh shown in Appendix A.2) it’s corresponding container. The veths are connected to the switch after which we run the commands to ping the other container and set up an iperf3 session. The results show that there is connectivity at L3 and L4:
Figure 4.6: Results setup 4
page 28 CHAPTER 4. METHOD 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
Figure 4.7: Packets inside server container.
Figure 4.8: Packet flow syn packet before and after switch.
The below table is a sumarize of all our setups with their results.
setup implementation nat checksum –network L3 connection L4 connection 1 switch1.p4 no no none yes no 2 switch1.p4 no no default yes no 3 switch2.p4 yes yes none yes yes 4 switch3.p4 yes yes default yes yes
Table 4.1: results connectivity between containers on single hosts using topology from figure 4.1
CHAPTER 4. METHOD page 29 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
4.2.3 TCP connection between containers in multiple hosts Figure 4.9 shows the topology that was used to set up a TCP connection between containers in multiple hosts. VMs were used to set up the container network. The hosts of the VMs are considered to be part of the underlay network. From the VM there is no or limited access to the underlay. In Figure 4.9 the underlay is represented by the black line which connects the two eth0 interfaces of each VM. When a packet leaves a container the addresses of in packet are private network addresses. A packet will be forwarded (in this scenario) to the public network. Thus, when a packet arrives at the first hop of the public network this hop doesn’t know the private network addresses of the packet and will not be able to forward this packet. The result is that this packet will be dropped. Therefore, the addresses of a packet have to be translated (NAT) into the public addresses as well as the address of the next-hop. With the public addresses, the packet can traverse through the public network and reach its destination. At the remote host, the public addresses have to be translated back to the private network addresses. The destination port (dport) of a packet can be set by a p4 switch. Once the dport is set, the underlay network will not alter the dport anymore. Therefore the dport can be used as the identifier for a connection between two containers. The format of the rules, used during this research is shown in listing 4.6. Adding the rules from Listing 4.6 creates a mapping from the specified dport to the source (src) and destination (dst):
The last two rules of the Listing 4.6 are almost the same. To initiate the nat action at the server the first and the second rule have to be added to the CLI. On the client rules one and three have to be added to initiate the nat action (as shown in Listing 4.5) Iperf3 uses port 5201 as default. In case of the single container, multiple hosts network topology, there is no need to use the destination port as an identifier for a connection between two containers.
page 30 CHAPTER 4. METHOD 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
When there are multiple containers on a single host the rule from Listing 4.9 has to be extended as shown in Listing 4.10:
1 # Rule for the client 2 table_add map_ip_to_port set_test_TCP2 => < egress port> 3 4 # Rule for the server 5 table_add map_ip_to_port set_test_TCP2 => < egress port> Listing 4.10: CLI rules for the switch3.p4 implementation
For now, the emphasis is on establishing a TCP connection for a single container, multiple hosts network as discussed in section 4.2.1. All the below described setups use the same topology.
• setup 1: Uses the switch2.p4 implementation as well as the –network=none flag to create the containers. Because of the implementation, it’s only possible to test for L4 connectivity using the topology of figure 4.9. The iperf3 session timed out. Thus, there is no connectivity at L3 (due to the implementation) nor L4.
• setup 2: Uses the same setup as setup 3 only the containers were created with the default network driver. The iperf3 session timed out. Thus, there is no connectivity at L3 (see setup 1) nor L4 • setup 3: Uses the switch3.p4 implementation. The containers were created with the network driver set to none. First, we checked if we could ping a remote container. In this setup, it’s possible to ping the remote container from the client and to ping from the server to the client. A connection on L3 level is therefore possible.
Figure 4.9: Results setup 3
CHAPTER 4. METHOD page 31 4.2. TCP CONNECTION BETWEEN CONTAINERS USING A P4 SWITCH
• setup 4: Uses the same setup as setup 3. The containers were created with the default network driver. The results were the same as with setup 3.
Figure 4.10: Results setup 4
The summerized results:
setup implementation nat checksum –network L3 connection L4 connection 1 switch2.p4 yes yes none no no 2 switch2.p4 yes yes default no no 3 switch3.p4 yes yes none yes no 4 switch3.p4 yes yes default yes no
Table 4.2: results connectivity between containers on multiple hosts. Topology that was used is shown in figure 4.9
page 32 CHAPTER 4. METHOD CHAPTER 5 Discussion
5.1 Multiple containers on the same host
The results show that for traffic without NAT, it’s important to update the checksum to get connectivity at L4. Even when the addresses aren’t translated. The switch2.p4 implementation has connectivity at L3 and L4. Combining the default network driver with the switch2.p4 implementation results in connectivity at L3 and L4. The implementation switch1.p4 only has connectivity at L3. The main difference between the implementation of switch1.p4 and switch2.p4 is that switch2.p4 supports a TCP checksum update and switch1.p4 doesn’t. This implies that even though there is no NAT involved, the checksum at L4 is updated. We didn’t expect that updating the (TCP) checksum would be necessary to establish L4 connectivity between local containers because there is no need to since the addresses aren’t translated. During testing we noticed that the switch considered the checksum of the TCP header invalid at ingress as is shown in figure 5.1.
Figure 5.1: Two containers single host
The packet headers only have been parsed but the headers haven’t been changed at this point. The checksum of the IP header is always considered valid at ingress. If TCP checksums of a packet are compared before it has been processed by the switch and after we noticed that the checksum has been changed as is shown the packet trace in figure 5.2.
Figure 5.2: Packet before and after traversing switch. The checksum is marked the red rectangle
33 5.2. SINGLE CONTAINER, MULTIPLE HOSTS
Figure 5.3: Two containers single host
Figure 5.2 confirms the log of figure 5.1 that the checksums is not correct when arriving in the switch. This implies that a packet sent by the container is sent with an invalid checksum which is by itself unexpected. The IP checksum does not change during the packet traversal through the switch. Although we currently have no explanation for this behavior, this could explain why the checksum doesn’t have to be updated to create L3 connectivity (because switch1.p4 has L3 connectivity) but must be updated before we can create connectivity at L4. In the deparser the checksum is updated as is shown in figure 5.3 Setup 3 and 4 both achieve connectivity at L4. This suggests that the checksum at L4 is correctly updated. We know from the logs that the switch (even though we didn’t expect it) updates the checksums. If the packets leaving the switch had an invalid TCP checksum, the container would have refused the connection. The connection at L4 was established. Therefore we assume that the checksum at L4 was computed correctly.
5.2 Single container, multiple hosts
Setup 3 from table 4.2 shows that there is connectivity at L3. The implementation used in setup 3 applies NAT on packets. Therefore, we can assume that NAT isn’t the cause for the connection to fail. NAT only changes parameters at L2 and L3. Thus, if the was a problem in translating the addresses this would have exposed itself in the form of a failed L3 connection. Therefore, we assume that the problem lays in layer 4. The switch touches the TCP layer at one point which is the checksum. Because the addresses of L3 are changed, the checksum of the TCP layer also has to change to avoid an invalid checksum. To verify if the TCP checksum is updated correctly the checksum was examined when the packet left the client container and when the packet arrived at the server container on the remote host. The packets TCP checksum had been changed. The previous section showed that the checksum might change even though all the parameters in the headers stayed the same but that there is L4 connectivity. Thus, the TCP checksum changing does not have to be a symptom of a failed connection. Currently, it’s unknown why the connec- tion at L4 fails.
With setup 1 and 2 of Table 4.2, there was no connectivity between the containers. Concerning L3, this was caused by the implementation of switch2.p4. The packets were never matched with the tables because the tables were applied at L4. The result was no-op being applied to all packets after that the packets were lost. If we apply the same tables at L3 level we expect there will be connectivity at L3. Due to a lack of time, we had no time to verify this. None of the above discussed results explains why the switch updates the checksums and why the checksum is considered invalid at ingress. The results also don’t explain why a switch updates the TCP checksum correctly in a local network but computes an invalid checksum in a single container, multiple hosts network. In both situations, the checksum is updated in the same way. The only difference is that in the second scenario the addresses had been translated. At the beginning of this section, we assumed that NAT couldn’t be the cause of the problem. Follow up research should be done to clarify this behavior.
5.3 Firewall settings
We also examined whether firewall settings could be the cause of the refused connection. The ufw status of the containers was inactive. All the default chain policies were set to ACCEPT and the page 34 CHAPTER 5. DISCUSSION 5.3. FIREWALL SETTINGS tables and chains didn’t contain any rules. Therefore it’s highly unlikely that the firewall settings of a container were the cause of the refused connection. Docker adds some firewall rules to the firewall of the host (VM) during its installation. The firewall rules are applied to the containers that belong to the subnet of the docker0 bridge. Setup 3 and 4 of the multiple containers, single host topology both show connectivity at L3 and L4. In setup 3 the containers do not belong to the subnet of the docker0 bridge. This suggests that a container does not have to be in the same network as the docker0 bridge. The firewall rules which docker installs during its setup, only apply to the containers within the docker0 subnet. That the firewall of the host is the cause of the refused connection between containers on multiple hosts, therefore, seems unlikely.
The amount of data sent through the network is still increasing. At the same time, the speedup of hardware resources gets limited by hitting the power wall. New methods must be explored the find a more efficient way of using resources and limiting throughput. Containers use fewer resources than VMs [14] and setting up an overlay network with a p4 switch as a forwarding mechanism gives the possibility to reduce the overhead that is caused in a Docker overlay solu- tion. Combining Docker containers and p4 switches could result in the advantages of an overlay network without its overhead. This research focussed on connecting containers to a switch and establishing a TCP connection between containers in two different scenarios. In the first scenario, the containers resided on the same host and were connected to the same switch. In the second scenario. Each container was connected to a p4 switch that was connected to the underlay. The underlay connected the hosts. The results show that packets can be sent by one container and arrive in the other container while the containers are connected by a p4 switch. From the results, we can also conclude that there is connectivity at L3 and L4 between containers which are connected to the same p4 switch and share the same host. For containers connected by p4 switches on multiple hosts, we must conclude that there is connectivity at L3 but there is no connectivity at L4. With these conclusions, the question: how can we create connectivity between containers by utilizing a p4 switch? is only partially answered. From the results, we can also conclude that the problem is the updating of L4 checksum. Another conclusion we can draw is that there is no difference in connectivity between the two (default and none) container configurations. This results in more flexibility to adjust a container to your own needs. The research question: how can we minimize the overhead in Docker container networks by using modern technologies such as p4 switches? can not yet be answered. There is no TCP connection possible in a multiple nodes, with one container per host network. Therefore, a comparison between the two networks couldn’t be made. The problem is narrowed down to be an invalid checksum update. To be able to answer the research question to checksum must be updated correctly.
6.2 Future work
Several options qualify for follow-up research. First, future research should focus on why the TCP connection gets refused by the server container. When a TCP connection is made possible the performance of a p4 switch network can be compared to the performance of different overlay networks. Second, to make a research more organized follow-up research could also focus automation of deploying an implementation file. As is shown in figure 2.2) there are many steps from creating the p4 file to deploy it to the switch. All the steps have to be executed manually which is cum- bersome and leads to many, easy to avoid, mistakes. Debugging with respect to the p4 switch is
37 6.2. FUTURE WORK cumbersome because of the lack of breakpoints and print statements. To also have to deal with manually collecting the arguments to be able to manually create and execute CLI rules, makes it almost impossible not to make mistakes. Especially when the network becomes more complex. To be able to scale up the number of nodes and the number of containers, the deployment process should be automated. The attention of the programmer currently is diverted between writing a program and creating the CLI rules. Divertion is not desirable and can have a severe negative impact on the development time and above all, it could be avoided. When the implementation is written and the containers are known, the CLI rules are fixed. By parsing the p4 file the tables, actions and the number of arguments are known. Parsing the containers will fill in the values of the arguments of a rule. How to parse the p4 file and combine the results of it with the arguments obtained by parsing the containers could be done in follow-up research.
Third, although there is no L4 connectivity between two containers on different hosts,at this point, there is L4 connectivity between containers on the same host VM. Follow-up research could compare the performance of a local network with the docker overlay network to get an impression of how a p4 switch influences the throughput. What should be kept into account is the performance of the simple switch. The performance of the bmv2 model is significant lower than the performance of a production-grade open vSwitch [13]. Also, to get the best performance the switch should be installed with the following flags: ./configure ’CXXFLAGS=- g -O3’ ’CFLAGS=-g -O3’ –disable-logging-macros –disable-elogger. There is no difference in connectivity between containers created with the default network driver and a container created with the network driver set to none. After configuring both containers we achieved the same connectivity. This gives the user more tools to adjust a container to its needs. I.e. in the research of [15] macvlan outperforms Linux bridges. Next time, we could try to set a macvlan inside the container, connected them to the switch and experiment if this improves the performance in a local network.
6.2.1 Ethics SDN companies that maintain the network will get better resources to control the network. P4 switches will help the companies to access better control over the data send through their network. In the past, it was the chip manufacturers who co-decided how packets were getting forwarded. They are bound to protocols and to deploy new protocols takes a lot of resources and time. With p4 evolving to a more mature state it will become less expensive and less time consuming to accomplish such control. This means that the pressure on network neutrality will increase. These manufacturers didn’t have a direct interest since their profit comes from selling the hardware. The network companies do have a direct interest in the data since it embodies their profit. Currently, a network company has limited resources to distinguish traffic. Network neutrality is under pressure. Determinism in philosophy states: That if the technique is there, it will happen. Thus, since with SDN and p4 switches we have the means and resources to discriminate traffic, therefor from a deterministic point of view its a matter of time before we allow the network companies to discriminate traffic and bill the consumers.
The research of [7] discusses science with respect to ethics. It’s difficult to place the tech- nologies used in this research into one specific area. This depends on their application they are being used in. We can categorize our research considering figure 6.1. All data used during this research was neutral since it was forged by ourselves. The research does not cross the line of human values. Therefore i would place this research in the safe area.
page 38 CHAPTER 6. CONCLUSIONS 6.2. FUTURE WORK
Figure 6.1: The different areas of science with respect to ethics
CHAPTER 6. CONCLUSIONS page 39 6.2. FUTURE WORK
page 40 CHAPTER 6. CONCLUSIONS Bibliography
[1] Idris Zoher Bholebawa, Rakesh Kumar Jha, and Upena D Dalal. “Performance analysis of proposed OpenFlow-based network architecture using mininet”. In: Wireless Personal Communications 86.2 (2016), pp. 943–958. [2] Github P4 BMV2DevTeam. Incorperation of bmv2 switch in mininet. 2019. url: https:// github.com/p4lang/behavioral-model/tree/master/mininet. (accessed: 01.04.2020). [3] Yan-Wei Chen et al. “P4-Enabled Bandwidth Management”. In: 2019 20th Asia-Pacific Network Operations and Management Symposium (APNOMS). IEEE. 2019, pp. 1–5. [4] Huynh Tu Dang et al. “Whippersnapper: A p4 language benchmark suite”. In: Proceedings of the Symposium on SDN Research. 2017, pp. 95–101. [5] Docker. Title: Docker containers. url: https://docs.docker.com/engine/reference/ run/. (accessed: 01.05.2020). [6] Kouji Hirata and Takuji Tachibana. “Implementation of multiple routing configurations on software-defined networks with P4”. In: 2019 Asia-Pacific Signal and Information Pro- cessing Association Annual Summit and Conference (APSIPA ASC). IEEE. 2019, pp. 13– 16. [7] ELEKTRA LIANGORIDI, ANTHONY VAN INGE, and MARY BETH KEY. “ETHICS IN SCIENCE: CREATING A CONSCIOUS ETHICAL IDEOLOGY”. In: (2014). [8] Sumit Maheshwari et al. “Comparative Study of Virtual Machines and Containers for DevOps Developers”. In: arXiv preprint arXiv:1808.08192 (2018). [9] Linux Namespaces Manual. Linux Programmer’s Manual. url: https://www.man7.org/ linux/man-pages/man7/namespaces.7.html. (accessed: 01.04.2020). [10] P4.org. P4: Language specification. url: https://p4.org/p4-spec/p4-14/v1.0.5/tex/ p4.pdf. (accessed: 01.04.2020). [11] F Paolucci et al. “P4 Edge node enabling stateful traffic engineering and cyber security”. In: Journal of Optical Communications and Networking 11.1 (2019), A84–A95. [12] Kun Suo et al. “An analysis and empirical study of container networks”. In: IEEE INFO- COM 2018-IEEE Conference on Computer Communications. IEEE. 2018, pp. 189–197. [13] p4 dev team. bmv2 model github. url: https : / / github . com / p4lang / behavioral - model/blob/master/docs/performance.md. (accessed: 01.04.2020). [14] Qi Zhang et al. “A comparative study of containers and virtual machines in big data envi- ronment”. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE. 2018, pp. 178–185. [15] Yang Zhao et al. “Performance of container networking technologies”. In: Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems. 2017, pp. 1–6.
41 BIBLIOGRAPHY
page 42 BIBLIOGRAPHY Appendices
43
APPENDIX A Appendix A
A.1 P4 language information
Figure A.1: Parameters standardmetatadata
45 A.2. SUPPORT PROGRAMS
A.2 Support programs
1 read conName 2 3 iflIndex=$(sudo ip netns exec $conName cat /sys/class/net/eth0/iflink) 4 x="$(($iflIndex-1))" 5 veths=$(ifconfig | grep veth | awk’{print $1}’) 6 7 for veth in $veths 8 do 9 iflnk =$(cat /sys/class/net/$veth/iflink) 10 11 if [ $x -eq $iflnk ] 12 then 13 echo"veth connected to eth0 of $conName is:" $veth 14 break 15 fi 16 done 17 18 sudo brctl delif $veth Listing A.1: find veth.sh
1 #!/usr/bin/env bash 2 3 echo"Enter(even) container nr:" 4 read conNr 5 conName=container$conNr 6 7 sudo ip link add vethA-con$conNr type veth peer name vethB-con$conNr 8 sudo ip link add vethA-br0 type veth peer name vethB-br0 9 sudo ip link set up vethA-con$conNr 10 sudo ip link set up vethB-con$conNr 11 12 sudo docker run --name $conName --network=none ubuntu:latest /bin/sleep 1000000 & 13 sleep 5 14 15 sudo docker exec -ti $conName cat /proc/net/route 16 sudo nsenter -t ‘sudo docker inspect -f’{{.State.Pid}}’ $conName‘ -n ifconfig 17 18 c4pid=$(sudo docker inspect -f’{{.State.Pid}}’ $conName) 19 sudo mkdir -p /var/run/netns/ 20 sudo ln -sfT /proc/${c4pid}/ns/net /var/run/netns/$conName 21 22 sudo ip link set vethB-con$conNr netns $conName 23 24 sudo ip netns exec $conName ip link set up vethB-con$conNr 25 sudo ip netns exec $conName ip addr add 172.15.0.$conNr/24 dev vethB-con$conNr 26 sudo ip netns exec $conName ip route add default via 172.15.0.$conNr dev vethB- con$conNr Listing A.2: create configure container.sh