Improving management of virtual container networks on DC/OS

Willem Jan Glerum - 11317493 [email protected]

July 28, 2018

Supervisor: dr. Paola Grosso [email protected] Host supervisor: dr. Adrien Haxaire [email protected] Host organisation: Lunatech Labs B.V. https://lunatech.com

Universiteit van Amsterdam Faculteit der Natuurwetenschappen, Wiskunde en Informatica Master Software Engineering http://www.software-engineering-amsterdam.nl Contents

Abstract 2

1 Introduction 3 1.1 Problem statement...... 3 1.2 Motivation ...... 3 1.3 Research questions...... 4 1.3.1 RQ1...... 4 1.3.2 RQ2...... 4 1.4 Organisation ...... 5

2 Background and context6 2.1 Containers...... 6 2.1.1 ...... 6 2.1.2 ...... 6 2.1.3 DC/OS ...... 7 2.2 CNI ...... 9 2.2.1 Project Calico ...... 9 2.2.2 Cilium...... 9

3 Problems with Virtual Container Networks 11 3.1 Required features...... 11 3.2 Current state of virtual networks in DC/OS...... 11 3.2.1 DC/OS virtual networks...... 12 3.2.2 CNI plugins on DC/OS ...... 13 3.3 Virtual networks in Kubernetes...... 14 3.4 Problems with virtual networks on DC/OS ...... 15

4 Proof of Concept 16 4.1 Extending CNI...... 16 4.2 Network Object API...... 16 4.2.1 Network Object Model...... 17 4.2.2 Overall System Design...... 18 4.2.3 PoC Architecture...... 18

5 Validation 21 5.1 PoC ...... 21 5.2 Concrete use case...... 22

6 Conclusion 24

7 Future work 25

Bibliography 26

Acronyms 28

1 Abstract

Developers use containers to deploy their applications highly available on container orchestration platforms. On Mesosphere DC/OS it is possible to isolate services with virtual container networks and network policies provided by CNI plugins. Isolation is often needed for privacy and security concerns and for separating different tenants on the platform. At the time of writing, there is no method in place to centrally manage virtual networks and network policies on DC/OS. The goal of this thesis is to understand the current problems in managing virtual container networks on DC/OS and to propose solutions to these problems. To do this, the study first gives background information on the used technologies to be able to identify the current problems. Based on this study, a proof of concept is presented. It is able to centrally manage network policies for different vendors. The solution proposed in this thesis shows that it is possible to improve the management of virtual networks without a radical change in the architecture of DC/OS. Operators are now able to retrieve the current state of applied network policies and adjust them as needed.

2 Chapter 1

Introduction

1.1 Problem statement

At the moment applications are moved from traditional deployments to container deployments. In- stead of manually logging into a server, downloading, installing and configuring application we can now download and run a container image with a single command. A container is a set of isolated processes and application dependencies that are packaged together in an image. This container im- age can run locally on a developer’s machine or on a container orchestration platform in production. Containers are considered more flexible than Virtual Machines (VMs) as you do not need to run and boot an entire Operating System (OS). The most well known system for running containers locally is Docker[18]. To make applications highly available in production people make use of a container orchestration platform like Kubernetes[25] or The Datacenter Operating System (DC/OS)[16] to al- low scaling of those containers. Cloud providers such as (AWS)[2] and Google Cloud Platform (GCP)[21] even enable you to run containers without worrying about the underlying infrastructure. In production workloads operators want to isolate applications. The traditional way is to use firewall policies and different physical networks between applications in a datacenter. Container operators need to be more flexible as workloads change all the time. That is what let to the introduction of Software Defined Networking (SDN) to create virtual networks within a container orchestration platform. There are different implementations and vendors providing virtual networks and policies, such as Project Calico[35] and Cilium[9]. These plugins are standardised with the Container Network Interface (CNI)[12] standard, a set of specifications and plugins for configuring virtual networks. CNI allows you to chain different plugins together, for example for IP Address Management (IPAM), virtual networks and network policies by Calico. The focus of this thesis will be on DC/OS and Calico as one of Lunatech’s clients uses those technologies for their container orchestration to build an Internet of Things (IoT) platform for cars. Collecting data from cars forces us to take privacy seriously and we have to make sure that the data is only accessible by the right services. Moreover the client wants to isolate workloads from different tenants within their organisation and logically split workloads in the data processing pipeline. Currently it is difficult for an operator of a cluster to see and manage the virtual networks and policies in DC/OS as each CNI plugin uses its own control plane. An operator must reach down the host machines to inspect the network configuration. To allow easy administration there is a need for a solution to at least get an overview of the networks and policies to be able to inspect those. At the moment there is no easy way for the client to check if workloads are really isolated.

1.2 Motivation

When there is visibility on the implemented networks and policies we can empower administrators with better means to manage the virtual networks on a container orchestration platform, enabling a developer to simply allow or deny traffic between different applications or namespaces. This can be

3 achieved by erasing the differences between the CNI plugins and hiding their complexity from them. As an extension we could also automatically provision new CNI plugins to help operators create new network services in a flexible manner, without worrying about the concrete implementation details. Furthermore when an operator is aware of the networks present he can use existing tools to get more insights in the network traffic for measuring network flows. And we could also provide basic networking monitoring solutions to the operator. On a container orchestration platform we also need an ingress point to allow traffic to the container workloads. This involves for example configuring load balancers on your cloud platform. However when something changes on the container orchestration level, an operator must also update the relevant load balancers to redirect the traffic to the correct containers. In and out going traffic to your cluster is called North-South traffic. East-West traffic on the other hand is within the cluster when for example a pool of web servers talk to a pool of application servers. The application servers could be on a different virtual network protected by network policies and need to be load balanced to be highly available for the web servers. Furthermore when this is in place we can also look at automating the tasks of an administrator as much as possible. Here we could think of self-adapting networks, where the system detects a sub- optimal solution and applies the new configuration in place, with or without the confirmation of the operator. It is often the case that a firewall rule is blocking traffic. Which can be hard to debug for an application developer who wants to deploy his application to find the exact problem, as his application works fine on his local machine. A visualisation can be created to help an operator to gain insight of the applied network policies, allowing to find the problematic policy faster in an interactive process. Self adapting networks and visualisations are outside the scope of this thesis, as we first need a way of exposing the current network infrastructure. When everything is in place it becomes easier to think of new use cases and ways to help operators and developers to debug the cluster and the applied policies.

1.3 Research questions

The following research questions were raised with the help of the motivation above during the formu- lation of this thesis:

• RQ1 What are the current problems in managing virtual container networks and how do dif- ferent vendors address this issue?

• RQ2 How can the management of virtual container networks and network policies be improved?

1.3.1 RQ1 First we need to know what the problems are in the current solutions with managing virtual networks, before we can propose a solution. This can be based on the direct experience of people within the client’s project and/or information found online or in previous research. Furthermore we could look at what other providers are doing to address this issue. At the moment Kubernetes is very popular and adapted by a lot of companies and developers. We could for example learn from their experience and mistakes.

1.3.2 RQ2 This question aims to find a solution for the concrete problems found with the first question. To prevent reinventing the wheel we should critically assess if we can reuse some of the solutions from other providers or standards. And see if we can improve the current situation by suggesting a new solution in the form of a Proof of Concept (PoC).

4 1.4 Organisation

This thesis is organised as follows: Chapter2 gives an overview of the relevant technologies, next Chapter3 lists the results for the research of RQ1. In Chapter4 we explain how the PoC built for this thesis works and in Chapter5 we see how this answers RQ2. To finish we discuss the results and future work in Chapter6 and7.

5 Chapter 2

Background and context

In this chapter we give more background information about the technology stack relevant to this thesis. Not all technologies are in use at the Lunatech’s automotive client at the moment, nevertheless we include them to provide more information about similar technologies or new technologies that we could use in the future. We start with containers and container orchestration platforms which form the basics of running containerised workloads. Next we explore the different virtual networks technologies with CNI.

2.1 Containers

A Linux container is an isolated set of resources on a compute host. Containers work with names- pacing: a container can only access resources in the same namespace and not any other namespaces. Meaning that a container can only see and modify its own processes and not other processes running in another container or on the host itself. Except for privileged containers which can use any resource and have access to every process on the host. Furthermore every container will get its own isolated network stack preventing privileged access to sockets and interfaces on other containers. One can create links and bridges to allow IP connectivity to and from a container. Resource utilisation for containers is limited with Control Groups (). Giving each container a restricted set of resources, preventing the container to bring down the host by using too much RAM or CPU[19].

2.1.1 Docker Docker is the standard in industry for running Linux containers. The Docker Engine is a container platform that makes it easy to run your container. You can run a container on your local developer machine and ship it, to run in a production environment. It works with a Dockerfile to describe how to build your image. Every step will create a new file system layer and all layers together are your Docker image. You can reuse these layers between containers so that you can make a generic application container and add a layer on top for the specific application. This allows for reusability and will save storage and bandwidth when sharing images. The most popular way to share images is the Docker Hub, however one can also host a private Docker Registry to store and serve your own images.

2.1.2 Kubernetes Kubernetes is an open-source container orchestration platform build from scratch, based on the ideas from the internal container engine used at Google. The basic scheduling unit in Kubernetes is a pod. A pod is collection of one or more containers that are co-located together on a host and share an unique IP address preventing port conflicts in the cluster. Pods allow you to run tightly coupled applications together.

6 Figure 2.1: Kubernetes architecture diagram[23]

Kubernetes consists of different loosely coupled components as can be seen in Figure 2.1. The masters run a key-value store called etcd[17], the API server and a scheduler component. An operator gives the cluster the desired state by running commands to the API server and stores its state in etcd. The scheduler makes sure that the desired state is achieved by for example choosing and instructing the nodes to run or delete the pods. The Kubernetes nodes, also called workers or minions, run the actual workloads which are Linux containers using the Docker Engine. The kubelet component manages the state of the node which will accept resource offerings from the scheduler. The kube-proxy is responsible for managing the network on the node. It routes incoming traffic to the pod addresses and it can expose pods bundled as a service to the cluster, serving as a load balancer to the pods or services.

2.1.3 DC/OS DC/OS describes itself in the documentation[44]: “As a datacenter operating system, DC/OS is itself a distributed system, a cluster man- ager, a container platform, and an operating system.” Unlike a traditional operating system DC/OS runs as a distributed operating system on different nodes and each node has its own host operating system. The master nodes accept tasks and schedule them on the agent nodes. Apache Mesos[5], a distributed systems kernel, is responsible for the lifecycle of tasks. It launches schedulers on the slaves based on the type of tasks. Figure 2.2 shows an overview of all the components in a DC/OS cluster. DC/OS has two built-in schedulers for scheduling Linux containers and two different container runtimes which can be seen in the container orchestration box on the master nodes in Figure 2.2. Marathon[29] runs the actual container orchestration engine, allowing to schedule tasks and frame- works. A task could be an application inside a Docker image and a framework could be Kafka[4] or Cassandra[3] which manage their own life cycle using Mesos. These frameworks and other services can be installed as a package from the Mesosphere Universe[31]. To periodically schedule jobs, op- erators can use Metronome[32], which allows to run jobs once or based on a regular schedule. On DC/OS operators are not only limited to running containers, as binary executable can be launched and containerised at runtime. The Docker Engine is present on every slave to run the Docker images or operators can choose to use the Mesos Universal Container Runtime (UCR)[37] to run Docker images and binary executables. The technologies and components mentioned above make it possible to run different kind of work- loads, not just containerised workloads. We can even run Kubernetes as a framework on DC/OS, this way the Kubernetes scheduler will ask Mesos for resource offerings.

7 Figure 2.2: DC/OS architecture diagram[13]

8 Figure 2.3: IP connectivity in Calico[34]

2.2 CNI

To deal with networking and containers the CNI specification was created, describing how to interact with network interfaces in Linux containers. This allows providers to write plugins to configure SDN with container workloads. Mesos, DC/OS, Kubernetes and even AWS have support for CNI plugins. Allowing operators to create virtual networks on container orchestration platforms. The CNI specification only deals with creating and deleting network resources, it does not do network policies for example. Network policies need to be managed by the plugin itself, as every plugin has a different goal and ways of enforcing policies. A similar project exists for the Docker daemon called Container Network Model (CNM)[27, 20] implemented as libnetwork where developers can also develop new plugins. For example Project Calico provides plugins for both CNI and CNM and both are installed when using the universe package from DC/OS. There is a set of standard provided plugins[36] that allow for basic operations like creating network interfaces and IPAM. Different kinds of interfaces can be created such as bridges, loopback interfaces and Virtual LANs (VLANs). IP addresses can be either requested from a Dynamic Host Configuration Protocol (DHCP) server or maintained by a local database of allocated IPs. Other plugin providers can provide their own implementations of assigning IP address to containers.

2.2.1 Project Calico Project Calico uses a pure Layer 3 approach to enable virtual networking between container workloads. As opposed to traditional solutions which in most cases use overlay networks. Overlay networks have the extra burden of en- and decapsulating packages between workloads when the traffic crosses different machines. Without this Calico is able to save CPU cycles and makes it easier to understand packets on the wire. Performance tests[40,8, 26] show that Calico is able to achieve nearly identical throughput to directly connected workloads. Overlay networks, and other plugins such as Weave[43] are not able to reach this throughput. Instead of using a vSwitch, Calico uses a vRouter to route traffic between containers. This vRouter uses the Layer 3 forwarding technique found in the Linux kernel. A local agent, called Felix, programs the routes from the containers to the host in the node’s route table. Furthermore, a BIRD Internet Routing Daemon (BIRD) is running on every node to advertise routes to containers on other nodes in the cluster. Figure 2.3 shows the hops made between two different containers or workloads. State information is exchanged using Border Gateway Protocol (BGP) or route reflectors. By default all containers can talk to each other in a Calico network. Operators want to separate traffic between different tenants of the compute cluster, by assigning policies. A policy is translated to a rule in iptables[39] on the host, leveraging the default Linux firewall. This can even be extended to specific rules for different workloads within a tenancy. The policies are shared using a distributed key-value store called etcd[17], as can be seen in Figure 2.4a.

2.2.2 Cilium A plugin that also takes a whole different approach is Cilium[9] which allows for Application Pro- gramming Interface (API) aware security policies. Instead of only operating at the network level with host and port based firewall rules, Cilium allows filtering of different protocols like Hypertext Transfer Protocol (HTTP), gRPC Remote Procedure Calls (gRPC)[22] and Kafka[4]. They do this by

9 (a) Project Calico component overview[1] (b) Cilium component overview[10] leveraging a new Linux kernel technology called Berkeley Packet Filter (BPF)[30,7] by dynamically inserting BPF bytecode into the Linux kernel, as can be seen in Figure 2.4b. This CNI plugin creates BPF programs in the Linux kernel that control the network access. The BPF programs are Just In Time (JIT) compiled to CPU instructions to allow for native execution performance. Operators can do a lot of different things with this technology: simply monitoring the packets, redirecting packets and blocking packets. Only recent versions of the Linux kernel have BPF capabilities, version 4.8.0 or newer is required.

10 Chapter 3

Problems with Virtual Container Networks

First we list the features required by operators to manage virtual networks in Section 3.1, secondly we explain the current difficulties found in managing virtual networks on DC/OS in Section 3.2. Next we present the approach taken by another popular container orchestration platform, Kubernetes, to handle virtual networks in Section 3.3. To finish we provide an overview of the found problems at the end of this chapter in Section 3.4.

3.1 Required features

We created a list of required features that an operator should be able to execute to manage virtual networks and policies on a container orchestration platform. Below is the list of the features with an accompanying explanation from the viewpoint of an operator of the cluster. As an operator I want to: list virtual networks: to see which networks are available in my cluster and see properties such as the name, tags, subnets and available IP addresses. select virtual network: to select a virtual network to launch my workload on. create virtual network: to provision and create new virtual networks in my cluster without manually configuring every agent in the cluster. list network policies: to get an overview of all the configured network policies in the cluster, which can be filtered based on service, tenant or label to search for specific rules that are blocking traffic. create network policy: to apply new firewall policies in the cluster without worrying about the different implementations of different plugin vendors. know the available IPv4 addresses: to determine if the solution is usable for the number of services in the cluster. have IPv6 support: to allow for native IPV6 support for my services and a bigger pool of available IP address for the growing number of applications with their own IP address.

3.2 Current state of virtual networks in DC/OS

DC/OS provides virtual networks by itself as overlay networks which can be used by the Docker and Mesos containerizer. This overlay is prepacked and enabled by default on the cluster. CNI plugins are also supported on DC/OS and can be used by both containerizers. There are three different networking modes available for containers on DC/OS:

11 Figure 3.1: DC/OS overlay in action[14]

Host networking The container shares the same network namespace as the host, this restricts the ports an application can use as the TCP/UDP port range is shared with the host. Bridge networking The container is launched on a bridge interface, has its own network namespace and applications are only reachable over port mappings to the host.

Container networking The container has its own network namespace and the underlying network is provided by CNI plugins. This networking mode is the most flexible as it allows operators to run applications on any port without dealing with port conflicts or port mappings.

3.2.1 DC/OS virtual networks The overlay network was introduced to enable an IP-per-container model in DC/OS. This allows operators to run applications on the default ports without worrying about port conflicts. This is done by creating an overlay network using Virtual Extensible LAN (VXLAN)[28] which is supported by the Linux kernel. It works by creating two network bridges on each host, one for each containerizer. Containers on the same host and bridge can communicate directly with each other over the network bridge. A packet from a Mesos container to a Docker container on the same host will be routed through both bridges. A packet from a Mesos container on Agent 1 to a Docker container on Agent 2 follows a different path. First the packet will be routed to the Mesos bridge, the host’s network stack consumes the package and encapsulates using VXLAN on Agent 1. Next Agent 2 decapsulates this packet and sends it up to the Docker bridge to be sent to the Docker container as can be seen in Figure 3.1. A container gets assigned a virtual network interface with the help of the DC/OS overlay CNI plugin.

12 IP addresses are managed by the agent itself instead of a central location. A central location would require a reliably and consistent way of IP address assignment to containers in the cluster. The default configuration uses a 9.0.0.0/8 subnet, this subnet is divided in smaller chunks: a /24 to be managed by every agent itself. On the agent this subnet is divided into two equal subnets for each containerizer, resulting in a /25 usable each container type on every agent. The agent’s overlay module can request its allocated subnet from the overlay module running on the masters based on its identifier. To launch a container on a virtual network an operator can select the required virtual network from the DC/OS web interface when configuring a task as DC/OS is aware of the default overlay networks. The default overlay network currently has the following limitations:

• A limit on the number of usable IP addresses on each host, determined by the network prefix size. The default maximum is 32 Mesos and 32 Docker containers.

• Only IPv6 support for Docker containers, Mesos containers will fail to start as the overlay network does not support IPv6 for Mesos containers.

• Operators are only able to create and delete virtual networks during the installation of the cluster.

• The length of network names is limited to 15 characters by the Linux kernel. • Marathon cannot execute health checks in the default configurations as Marathon itself is not running on the virtual network.

• No API to create and manage network policies on the virtual networks.

3.2.2 CNI plugins on DC/OS To allow for more flexibility with virtual networks on DC/OS, operators can also chose to cre- ate virtual networks with other providers. Plugins following the CNI specification as discussed in Section 2.2 can be installed on DC/OS. This works by placing the plugin in a dedicated folder on every agent. Networks can be created and configured by placing their relevant configuration files on every agent. Plugin location: /opt/mesosphere/active/cni/ and configuration location: /opt/mesosphere/etc/dcos/network/cni/. A restart of the Mesos agent is required to make use of the new plugin or new configuration. The method mentioned above is only possible when using the Mesos containerizer. For Docker we need to create a Docker network on every agents with the CNI plugin as its driver. Next any container can be launched on a virtual network by adding the following JSON to the con- figuration of the application: {"ipAddress": {"networkName": "<>"}}. This will instruct the Mesos scheduler to request an IP address from that network using the configured CNI plugin. The plugin will take care of IPAM and creating and deleting of the network interface on the host machine. Depending on the features of the CNI plugin, network policies can be applied by using the API of the plugin. The CNI plugin model for DC/OS makes it flexible for operators to configure different virtual networks. However there is a drawback to this approach, as these networks are not visible to the operators and users of the cluster, as stated on the documentation[15]: “NOTE: The network tab currently displays information about containers that are asso- ciated with virtual networks managed by DC/OS overlay. It does not have information about containers running on virtual networks managed by any other CNI/CNM provider.” This makes it confusing for developers who want to deploy their applications. Colleagues on the client’s project are having a hard time to understand how their applications are deployed and how to connect to them. Containers now live on a different network and are not always available on every machine. For example the SSH jumpbox used to connect to the cluster is not setup with Calico and is unaware of the virtual networks. It does not have the routes in its route table to connect to the container workloads on a virtual network.

13 There is also a problem with the different types of Domain Name System (DNS) providers within DC/OS. The mesos-dns component provides a Fully Qualified Domain Name (FQDN) to every ser- vice with the following structure: ...mesos. This domain name however returns the IP address of the host of the container. In the case that the container is running on a virtual network it will return the wrong IP address. A connection is not possible because the port is only in use on the container network interface, resulting in a time-out for a user request. It is possible with a non recommended workaround to instruct Mesos to return the container IP address instead of the host IP address[33]. The other and newer DNS provider is dcos-dns providing different addresses based on the network type of the container, these entries are listed in Table 3.1 together with the mesos-dns.

DNS name Container IP Host IP *.autoip.dcos.thisdcos.directory Container networking Host or bridge networking *.agentip.dcos.thisdcos.directory X *.containerip.dcos.thisdcos.directory X *.mesos X Table 3.1: Different DNS entries for DC/OS tasks

To summarise we have the following drawbacks of using CNI plugins on DC/OS to create virtual networks:

• No method to retrieve or select the virtual networks created with CNI in the DC/OS web and command line interface. Operators can specify a virtual network name for a container to attach it to that network, however there is no place to see the available networks created with CNI plugins.

• No generic method of configuring network policies, as each plugin provider has its own dedicated API.

• An operator manually has to distribute the plugin and configuration files to every slave using a custom build solution.

• Different DNS providers return different IP addresses for container workloads on a virtual net- work.

3.3 Virtual networks in Kubernetes

There are different ways of connecting containers in Kubernetes. The most basic one is in a pod where containers in the same pod can communicate via localhost. Pods get their own unique IP address, called the IP-per-pod model, preventing operators to deal with port mappings and Network Address Translation (NAT) to connect to pods from another pod or host. This network can be implemented by a CNI plugin or other SDN solutions[24] which are available in Kubernetes. Furthermore if a plugin supports network policies an operator can leverage the plugin to configure network policies in the cluster. All the virtual network related options can be configured using the kube-apiserver as discussed in Section 2.1.2, such as configuring the network provider and managing its network policies regardless of the chosen implementation. This makes it easy and transparent to the operators of a Kubernetes cluster. To connect to a pod, from another pod or externally through a load balancer, we cannot use the pod IP address as this IP can change when pods are being rescheduled by the scheduler or when we have a set of the same pods for high availability. We can define a service which uses the kube-proxy, which is present on every node, to route the requests to the right set of pods. By default traffic from the client is captured by iptables[39] on the host and redirected to be forwarded by the kube-proxy.

14 Feature DC/OS over- DC/OS CNI plugins Kubernetes lay networks List virtual networks Available Operator has to manual Available inspect the CNI configura- tion files on every host Select virtual network Available Operator has to manual Available provide a virtual network name based on the list ob- tained from above Create virtual network Only at installa- Manual on every host Available tion time List network policies Network policies Operator has to use the Available are not provided plugin specific API Create network policy Network policies Operator has to use the Available are not provided plugin specific API Available IPv4 addresses 32 per con- Plugin specific Plugin specific tainerizer per host IPv6 support Only for Docker Plugin specific Plugin specific containers

Table 3.2: Available features for virtual container networks

3.4 Problems with virtual networks on DC/OS

In the first section of this chapter we showed that operators can choose to launch a container on a virtual network provided by DC/OS. Using the default overlay network, operators can select the network from the web interface as DC/OS is aware of the networks that have been created with the overlay module. However this is not the case with virtual networks provided by CNI plugins. Table 3.2 gives an overview of the virtual network related features available in DC/OS and Kubernetes. As you can see Kubernetes does provide most of the features already as it stores the network state with the kube-apiserver.

15 Chapter 4

Proof of Concept

In Chapter3 we explained the difficulties of managing virtual networks and policies in DC/OS. The main problem is that operators of clusters cannot get an overview of the created virtual networks and the applied network policies by CNI plugins. In this chapter we propose two solutions: first we explain how we could do this by expanding the CNI specification, next we report on the solution we implemented in the PoC.

4.1 Extending CNI

One of the first ideas we had was to extend CNI to be bi-directional. This addition would allow operators to retrieve information about the virtual networks, to see which virtual networks are present and which containers are connected to the virtual network. The specification already has an ADD action that returns the network state of a container when it has been created. However this does not allow querying of the current state of the networks. It would require to create a custom solution to keep track of the current network state and follow the changes to the containers. To make this work we would have to adapt the CNI specification to include a new action to allow querying of the network state of containers. This new action could logically be called GET, returning the same information when the ADD action was invoked and allows for caching of the results. We decided not to touch the CNI specification for this thesis, as it would have needed too much time: creating a new action in the CNI specification would require coordination and working together with different people and companies developing CNI plugins. At the beginning of this thesis the community was already discussing the need for a new action that would retrieve the current state of the network of a container. Maintainers did not agree yet on whether this should be implemented. Not only would we have to adjust the specification, we would also have to extend the CNI plugins in use at the client to support a new action. Furthermore this action will only expose information about the network state of the container, not the state of the virtual networks in the cluster which operators want to see. Network policies are not included in the CNI specification. With the new GET action in place we would still need a different solution for managing network policies.

4.2 Network Object API

Instead of extending the CNI specification we decided to build a PoC for a new system in DC/OS. A central API in DC/OS for managing components related to virtual container networks. We introduce a network object model. In this model an operator can define a network object with the relevant configuration for a network, much like the objects used in the kube-apiserver. First we explain the model in Section 4.2.1, next the global system design and features in Section 4.2.2 and we conclude with the implemented PoC in Section 4.2.3.

16 4.2.1 Network Object Model The network object model contains different properties for defining networks in DC/OS. The specifi- cation for this model can be found in Listing 4.1 and is written in Protocol buffers format. We explain the basic idea for each component. Note: this is a first version and can be evolved over time.

Listing 4.1: Network Object Model in Protocol buffers format

1 message NetworkObject{ 2 repeated VirtualNetwork network=1; 3 repeated NetworkDriver driver=2; 4 repeated NetworkService service=3; 5 } 6 7 message VirtualNetwork{ 8 required string name=1; 9 repeated string namespace=2; 10 required string driver=3; 11 repeated string subnet=4; 12 repeated string service=5; 13 repeated NetworkPolicy policy=6; 14 } 15 16 message NetworkService{ 17 enum Type{ 18 Firewall=1; 19 DNS=2; 20 L4LB=3; 21 QoS=4; 22 } 23 required Type type=1; 24 required string name=2; 25 } 26 27 message NetworkPolicy{ 28 enum Type{ 29 Ingress=1; 30 Egress=2; 31 } 32 required Type type=1; 33 repeated Selector selector=2; 34 repeated Port port=3; 35 } 36 37 message Selector{ 38 enum Type{ 39 VirtualNetwork=1; 40 Task=2; 41 } 42 required Type type=1; 43 required string matcher=2; 44 } 45 46 message Port{ 47 required string protocol=1; 48 required int port=2; 49 } 50 51 message NetworkDriver{ 52 // We need some more fields here 53 required string name=1; 54 }

Virtual Network Specifies the basic properties of a network. Containing the name and subnet of a virtual network. Network Driver Driver for the specified network, which can be a DC/OS overlay network, a stan-

17 dard plugin, or a CNI plugin. Network Policy A set of firewall rules for a given network. These rules can be either level 4 rules for packet filtering based on port and host or level 7 rules to inspect on application level. Rules can also be applied based on labels to firewall specific applications or traffic. Network Service Services could be a set of different things to attach to a network. For example a cloud provider integration for ingress traffic to the cluster.

4.2.2 Overall System Design The idea is that this new service runs on the masters of a DC/OS cluster, residing in the networking box on the master nodes from Figure 2.2. We choose the masters as they are already highly-available in a default setup. Both CockroachDB[11] and Zookeeper[6] are already running on the master providing distributed storage for other DC/OS components. It is also possible to use etcd[17] as a central key-value store on the master. However this decision is out of scope for this thesis as the PoC in Section 4.2.3 uses a different approach. The system is based on the idea of the kube-apiserver from Kubernetes as discussed in Section 2.1.2 where an operator can query the current state and submit the required state. The system would not only provide insight to the available virtual networks and policies, but also be open to integrate with cloud providers for North-South traffic into the cluster for example. The new system should have the following features:

• Run highly available on the masters • Use a distributed highly available storage solution • Show an overview of the available virtual networks

• Allow to create, configure and delete virtual networks • Give an overview of the applied network policies per network, tasks or framework • Allow to create, configure and delete network policies

• Integrate with cloud providers to configure ingress traffic

4.2.3 PoC Architecture We want to show a basic PoC with a limited set of features where an operator can see and configure network policies for a given container on a virtual network. The code developed for this PoC can be accessed via an online repository hosted on Github1. Showing the available networks can be easily achieved with this new API in place, however creating virtual networks is more difficult. To create a virtual network on demand we would have to distribute the CNI plugin and its configuration to every node, configure and restart every agent in the cluster, which would require a custom build solution to achieve. Mesosphere is working on providing a system to distribute files to nodes for container volume mounts, which we could later use to distribute network specific files to every node. Mesosphere recently added support for running Kubernetes as a service on DC/OS. We reuse the kube-apiserver with Custom Resource Definitions (CRDs) to submit and query network objects to be stored in etcd. A CRD is a custom object that works the same as other objects used by the kube- apiserver explained in Section 2.1.2. This saves us from the burden of writing our own api-server and setting it up manually on the masters. The kube-apiserver runs as a single container, however it depends on other Kubernetes components such as etcd to store information. Instead of trying to strip the kube-apiserver from Kubernetes components, we run an entire Kubernetes cluster on our DC/OS cluster. This does cause some overhead, but we are still able to run a full cluster on a single Macbook Pro with 16GB of RAM. It did took some time to tune the CPU and memory usage to allow it to

1https://github.com/wjglerum/dcos-networkobject-controller

18 Figure 4.1: Sequence diagram for the custom controller run on a single developer machine. For a production ready solution we should implement both the api-server and its storage to run on the masters as discussed in Section 4.2.2. Next we created a custom controller written in GoLang[38] based on one of the examples provided by Kubernetes for dealing with CRDs. This controller listens to the updates on the kube-apiserver of a specific CRD. With a custom Dockerfile we can build a Docker container and run it on the DC/OS cluster with Marathon. We connect it to the kube-apiserver allowing us to rapidly prototype and try out new queries for our network objects. The only difficulty is getting the configuration and credentials for access to the Kubernetes cluster in our custom controller. Luckily Marathon has support for volume mounts, so we can generate the configuration file once on the host and allow the container to read it from the volume mount to connect to the Kubernetes cluster. When we submit a CRD containing a network object with network policies to the kube-apiserver, our custom controller listens for those events, see Figure 4.1. Currently we support create, delete and list commands, updates are handled as a delete and create action to simplify the code base. In the case of an add command the controller checks which driver it is using, in our case only Project Calico is supported for now, and uses that plugin to create the network policies. In case of a delete command the controller removes the policy using the plugin. The state of the network policy is saved by the kube-apiserver in etcd and can be queried using the kubectl command from Kubernetes. Listing 4.2 lists the step to deploy all the components of this PoC, the addition JSON configuration files can be found online in the before mentioned repository.

Listing 4.2: Steps to deploy the PoC $ git clone https://github.com/dcos/dcos−vagrant $ cd dcos−vagrant $ cp VagrantConfig −1m−1a−1p.yaml VagrantConfig.yaml $ vagrant up $ dcos cluster setup m1.dcos $ dcos package install etcd −−yes −−options=etcd. json $ dcos package install calico −−yes $ dcos package install kubernetes −−yes \ −−package−version=1.0.3 −1.9.7 \ −−options=kubernetes. json $ dcos kubernetes kubeconfig // Move the generated kube config from your local machine to the DC/OS agent $ dcos marathon app add custom−controller.json

19 Figure 4.2: Technology stack of the proof of concept

Once the stack is up and running we can inspect the logs of the custom controller to see what is happening and interact with it using kubectl from our local machine. In short we use the following components for this PoC, visualised in Figure 4.2:

• local DC/OS cluster running in VirtualBox[42] using Vagrant[41], • Kubernetes as a framework on DC/OS, utilising the kube-apiserver with etcd to store our network objects,

• custom controller listening for events and integrates with the Calico API.

20 Chapter 5

Validation

This chapter reports on how the PoC from Chapter4 answers RQ2 and how it addresses the issues from RQ1 as described in Chapter3. Next we explore how this PoC manifests itself at Lunatech’s automotive client.

5.1 PoC

The introduced PoC only addresses a part of the features from Table 3.2. We only focused on network policies and not on creating and listing virtual networks. For the features relevant to network policies we explain how the PoC achieves them below.

List network policies Listing network policies is possible using the plugin specific API. We gener- alised this API and made it available for an operator, allowing her to see all the applied network policies in the cluster without diving into the concrete plugins for their specific APIs. In List- ing 5.1 we see an example of an applied network policy, in this case we block all incoming traffic to containers with the role webserver on port 80.

Listing 5.1: Example of applied network policies ”networkPolicies”: [ { ”name”: ”test −p o l i c y ” , ” ports ” : [ { ” port ” : 80 , ”protocol”: ”tcp” } ], ”selectors”: [ { ”matcher”: ”role==webserver”, ”type”: ”label” } ], ”type”: ”ingress” } ]

Create network policies Creating network policies leverages the same plugin model as listing the network policies. An operator is now able to create network policies from a central place. When we apply the network policy from Listing 5.1 the controller will transform the policy to the plugin specific model. Listing 5.2 shows the created policy from the Calico point of view.

21 Listing 5.2: Example of configured network policy in Calico [ { ”kind”: ”policy”, ”apiVersion”: ”v1”, ”metadata”: { ”name”: ”test −p o l i c y ” } , ” spec ” : { ”ingress”: [ { ”action”: ”deny”, ”protocol”: ”tcp”, ” source ” : {} , ”destination”: { ” ports ” : [ 80 ] } } ], ”selector”: ”role == ’webserver’”, ” types ” : [ ” i n g r e s s ” ] } } ]

Listing the virtual networks created by CNI plugins is possible with the custom controller, as the state of the network is simply stored using the kube-apiserver. However creating the virtual networks deemed more difficult, requiring custom solutions to distribute the different plugin files and configuration files to every node. We do list how these features could impact the management of virtual networks for operators and developers. List virtual networks When a network is created, the state gets stored in the custom controller. We can then can run a simply query to obtain the available networks in the cluster. This allows operators to get an overview of the networks, useful for debugging network problems or analysing network flows for example. Select virtual network When we have the list of available networks, developers can use this select the right network for their application. Instead of guessing the right network name or diving into the agent to see the configured networks.

Create virtual network When there is a system in place to manage files across all agents, we could extend our controller to support the creation of virtual networks. An operator would provide the desired configuration and relevant files and the controller takes care of configuring the new virtual network on every agent.

5.2 Concrete use case

Lunatech’s automotive client is building a platform to process data generated by cars in real-time. The goal of this platform is to develop new applications that use the collected data to gain new insights into the car that are beneficial to the company. At the moment they have multiple DC/OS clusters for different tenants, zones and stages. Tenants within the cluster could be subsidiaries or

22 3rd parties that need access to the platform. Zones are different parts of the data processing pipeline and stages are for example acceptance and production environments. They would like to combine multiple clusters to reduce costs and complexity. Furthermore, in this project there are data privacy concerns which calls for isolation of services and data flows. To achieve this isolation they use Project Calico on their DC/OS cluster and enforce network policies. In the future they might use for example Cilium to protect certain Kafka topics or REST endpoints. At the moment it is hard for a developer to see which network policies are blocking the traffic, often leading to unwanted solutions where they simply try to remove their application from the virtual network. There is nothing visible in the DC/OS web interface to inspect the created virtual networks and the applied network policies. With the PoC in place we are able to transparently expose the network policies. At the moment this is not yet visible in the web interface of DC/OS, however we could easily add this in a future version using the custom controller we have built. This will allow developers to inspect the relevant network policies for their applications and adjust them if needed. Furthermore our custom controller is plugin agnostic, meaning that a developer can inspect and create network policies for different plugins with the same interface. This follows the model from Kubernetes where developers can specify the wanted state and the controller uses the plugins to configure the relevant networks and policies. Using the custom controller we can also show to stakeholders, such as the central IT department, that indeed services are isolated from each other. This is especially important as the project has to be approved before it can be used with real data. With the new custom controller in place we can easily show the applied network policies and adjust quickly if needed.

23 Chapter 6

Conclusion

This thesis describes the difficulties in managing virtual networks on DC/OS. For operators it is possible to create virtual networks using CNI plugins and some have support for network policies. However there is no central place in DC/OS to see and configure those networks and policies. Unlike in Kubernetes where operators can manage the virtual networks by providing a desired state, the cluster then takes care of configuring the virtual networks and network policies. We first explored the different technologies in use and explained how they all work. This gives us a good foundation to be able to understand the difficulties in managing virtual container networks. Next we conducted a research on the features that are missing to manage those networks and how other vendors such as Kubernetes handle this. Next a PoC was built to improve the management of virtual networks on DC/OS. This PoC demon- strates that it is possible to abstract different SDN solutions into one central API. This API can be used by operators to create new policies regardless of the plugins implemented within the cluster. Furthermore it enables developers to debug and see which policies are blocking traffic to their ap- plications. Giving insight to developers deemed crucial as developers would end up removing their applications from the virtual network to make them work. This is an unwanted situation at Lunat- ech’s automotive client because they need to isolate different parts of their data processing pipeline for privacy and security concerns. The newly built api-server is able to provide useful insights into the state of the network. To conclude we can say that the new solution improves the management of virtual networks on DC/OS. As operators and developers are able to retrieve and configure the virtual network state from a central place instead of having to dig into the concrete implementations of each SDN provider.

24 Chapter 7

Future work

At the moment the PoC includes an entire Kubernetes cluster, so that’s one obvious point of improve- ment towards a final solution. As future work we could implement our own api-server for the network objects and make it run highly available on all the masters in a DC/OS cluster. Furthermore we need to extend the dcos-cli to include the new network objects that we have defined. Once this is in place we can extend the web interface of DC/OS to query the api-server for network objects and display them in the web interface. This will help developers and operators to check which virtual networks are available and which network policies have been applied to their applications. In the future we could also create network diagrams to allow for inspection during a production problem or a security audit. During this thesis we did not consider the creation of virtual networks from a central place as this was hard to achieve without a system that could distribute files to all agents. If such a system is in place we could make use of it to distribute the CNI plugins and their configuration files for the virtual networks, allowing developers to create new virtual networks using the custom controller that we implemented. Once the creation of virtual networks is implemented we can also provide the list of available virtual networks back to the users for them to select a network for their container. Another extension could be a plugin model for cloud providers. At the moment operators need to manually configure load balancers as ingress points on a cloud platform to point to the right services, also called North-South traffic for a cluster. In the future we could create a plugin per cloud provider that automatically configures the load balancers, certificates and DNS records to point to the right services. We could also think of chaining different kind of plugins together. For example a plugin that logs all packets and another plugin that does traffic shaping of the same traffic. We could also use the newly available information from the api-server to determine bottlenecks within clusters. When we have the list of current networks and policies we can leverage existing tools to monitor the network. Together with machine learning we could optimise traffic flows and apply the new network configuration using the same api-server. This could be a form of self adapting networks where we can ask the operator for confirmation or apply the new configuration automatically.

25 Bibliography

[1] About Calico. url: https://docs.projectcalico.org/latest/introduction/ (visited on 03/07/2018). [2] Amazon Web Services. url: https://aws.amazon.com/ (visited on 06/19/2018). [3] Apache Cassandra. url: https://cassandra.apache.org/ (visited on 06/28/2018). [4] Apache Kafka. url: https://kafka.apache.org/ (visited on 06/19/2018). [5] Apache Mesos. url: https://mesos.apache.org/ (visited on 06/28/2018). [6] Apache Zookeeper. url: https://zookeeper.apache.org/ (visited on 07/04/2018). [7] BPF and XDP Reference Guide. url: http://docs.cilium.io/en/latest/bpf/ (visited on 06/19/2018). [8] Calico dataplane performance. url: https://www.projectcalico.org/calico-dataplane- performance/ (visited on 12/22/2017). [9] Cilium. url: https://cilium.io/ (visited on 06/19/2018). [10] Cilium: Concepts. url: https://cilium.readthedocs.io/en/latest/concepts/ (visited on 06/19/2018). [11] CockroachDB. url: https://www.cockroachlabs.com/ (visited on 07/04/2018). [12] Container Network Interface - networking for Linux containers. url: https://github.com/ containernetworking/cni (visited on 06/19/2018). [13] DC/OS Components. url: https://docs.mesosphere.com/latest/overview/architecture/ components/ (visited on 06/28/2018). [14] DC/OS Overlay. url: https://docs.mesosphere.com/latest/overview/design/overlay/ (visited on 07/01/2018). [15] DC/OS Software Defined Networking. url: https://docs.mesosphere.com/latest/networking/ SDN/ (visited on 07/01/2018). [16] DC/OS: The Definitive Platform for Modern Apps. url: https : / / dcos . io/ (visited on 06/19/2018). [17] Distributed reliable key-value store for the most critical data of a distributed system. url: https: //github.com/coreos/etcd (visited on 06/19/2018). [18] Docker: Build, Ship, and Run Any App, Anywhere. url: https://www.docker.com/ (visited on 06/19/2018). [19] Docker security. url: https://docs.docker.com/engine/security/security/ (visited on 06/28/2018). [20] Rajdeep Dua, Vaibhav Kohli, and Santosh Kumar Konduri. Learning Docker Networking. Packt Publishing Ltd, 2016. [21] Google , Hosting Services & APIs. url: https://cloud.google.com/ (visited on 06/19/2018). [22] GRPC. url: https://grpc.io/ (visited on 06/19/2018).

26 [23] Kubernetes Architecture. url: https://kubernetes.io/docs/concepts/architecture (vis- ited on 06/28/2018). [24] Kubernetes: Cluster Networking. url: https://kubernetes.io/docs/concepts/cluster- administration/networking/ (visited on 07/23/2018). [25] Kubernetes: Production-Grade Container Orchestration. url: https://kubernetes.io/ (vis- ited on 06/19/2018). [26] Arthur Chunqi Li. Calico: A Solution of Multi-host Network For Docker. url: http://chunqi. li/2015/09/06/calico-docker/ (visited on 12/22/2017). [27] libnetwork - networking for containers. url: https://github.com/docker/libnetwork (visited on 07/23/2018). [28] Mallik Mahalingam et al. Virtual extensible local area network (VXLAN): A framework for overlaying virtualized layer 2 networks over layer 3 networks. Tech. rep. 2014. [29] Marathon: A container orchestration platform for Mesos and DC/OS. url: https://mesosphere. github.io/marathon/ (visited on 06/28/2018). [30] Steven McCanne and Van Jacobson. “The BSD Packet Filter: A New Architecture for User-level Packet Capture.” In: USENIX winter. Vol. 93. 1993. [31] Mesosphere Universe. url: https://mesosphere.github.io/universe/ (visited on 07/23/2018). [32] Metronome: Cron Service for DC/OS. url: https://dcos.github.io/metronome/ (visited on 06/28/2018). [33] Project Calico: Connecting to Taks. url: https://docs.projectcalico.org/v2.6/getting- started/mesos/tutorials/connecting-tasks (visited on 07/05/2018). [34] Project Calico Learn. url: https://www.projectcalico.org/learn (visited on 03/07/2018). [35] Project Calico - Secure Networking for the Cloud Native Era. url: https://www.projectcalico. org/ (visited on 06/19/2018). [36] Some standard networking plugins, maintained by the CNI team. url: https://github.com/ containernetworking/plugins (visited on 06/19/2018). [37] Supporting Container Images in Mesos Containerizer. url: https://mesos.apache.org/ documentation/latest/container-image/ (visited on 06/28/2018). [38] The Go Programming Language. url: https://golang.org/ (visited on 07/05/2018). [39] The netfilter.org ”iptables” project. url: https://www.netfilter.org/projects/iptables/ index.html (visited on 06/19/2018). [40] Vadim Tkachenko. Testing Docker Multi-Host Network Performance. url: https://dzone.com/ articles/testing-docker-multi-host-network-performance (visited on 12/22/2017). [41] Vagrant by HashiCorp. url: https://www.vagrantup.com/ (visited on 07/27/2018). [42] VirtualBox. url: https://www.virtualbox.org/ (visited on 07/23/2018). [43] Weave Net. url: https://www.weave.works/oss/net/ (visited on 06/19/2018). [44] What is DC/OS? url: https://docs.mesosphere.com/latest/overview/what-is-dcos/ (visited on 06/28/2018).

27 Acronyms

API Application Programming Interface. 10, 13–16, 18, 20, 21, 23 AWS Amazon Web Services.3,9 BGP Border Gateway Protocol.9 BIRD BIRD Internet Routing Daemon.9 BPF Berkeley Packet Filter. 10 cgroup Control Group.6 CNI Container Network Interface.3,4,6,9–16, 18, 21–24 CNM Container Network Model.9 CRD Custom Resource Definition. 18, 19 DC/OS The Datacenter Operating System.3,7–9, 11–20, 22–24 DHCP Dynamic Host Configuration Protocol.9 DNS Domain Name System. 14, 24 FQDN Fully Qualified Domain Name. 14 GCP Google Cloud Platform.3 gRPC gRPC Remote Procedure Calls. 10 HTTP Hypertext Transfer Protocol. 10 IoT Internet of Things.3 IPAM IP Address Management.3, 13 JIT Just In Time. 10 NAT Network Address Translation. 14 OS Operating System.3 PoC Proof of Concept.4,5, 16, 18–24 SDN Software Defined Networking.3,9, 14, 23 UCR Universal Container Runtime.7 VLAN Virtual LAN.9 VM .3 VXLAN Virtual Extensible LAN. 12

28