Networking in Containers and Container Clusters

Victor Marmol, Rohit Jnagal, and Tim Hockin

Google, Inc. Mountain View, CA USA {vmarmol,jnagal,thockin}@google.com

Abstract Containers are becoming a popular way of deploying applications History quickly, cheaply, and reliably. As with the early days of virtual The first container versions at Google were aimed at machines, a variety of container networking configurations have providing resource isolation while keeping high utilization. been proposed to deal with the issues of discovery, flexibility and As such, the only networking goals for these containers performance. Using Docker we present available networking were to provide a discoverable address and not lose configurations along with many of the popular networking setups, performance. Port allocation, the and allotment their uses, and their problems today. A second aspect we will of ports as a first-class resource, solved these problems explore are containers in clusters, as these systems rely even more nicely for some time. Since all internal Google applications heavily on the network. We use Kubernetes as an example of this were trusted, there was no need to isolate the applications’ type of system. We present the network setups of the core network views from each other, or protect the host from components of Kubernetes: pods and services. application containers. Outside Google there was a need for more isolated containers. LXC came up with multiple options to configure networking - starting with virtual Keywords Ethernet interfaces and over time covering VLAN, containers, Kubernetes, Docker, libcontainer, , MACVLAN, and exposing dedicated physical devices [4]. namespaces, ipvlan, pods, microservices, container cluster, These configurations provide a completely isolated container manager. networking view for apps running in containers. For clusters, the main networking goal is discovery and addressing for replicated containers as they move around Introduction through restarts, updates, and migrations. Containers, as a user-space abstraction, have been around for many years. At Google, our earliest uses of containers started around 2004 with basic chroot and ulimit setups Namespaces and enhanced in 2006 with the introduction of cgroups [1]. Namespaces, as the name implies, restrict a particular Docker’s [2] release in 2013 brought an ease of use to kernel resource dimension to its own isolated view. The containers that have helped them to become ubiquitous. current set of supported namespaces are pid (process Developers targeting all kinds of applications, from large views), mnt (mounts), ipc (shared memory, message distributed systems to embedded networking components, queues etc), UTS (host and domain names), user (uids and are turning to containers (and Docker containers) to build, gids), and net (network view). Each net namespace gets its run, and deploy these applications. A common trend own isolated network interfaces, loopback device, routing amongst all these varied uses is the heavy emphasis on table, iptable chains and rules. By assigning IPs to virtual network-based communications. More and more we see interfaces within a container, multiple apps in different microservice architectures as the prefered way to use containers can, for example, all have a port 80 exposed. containers. This has reached a point where almost all container-based applications we see today rely on the Networking in Containers network to perform their roles. The networking The networking aspects of containers primarily center configurations used to achieve this communication are around the use of two of the kernel’s namespaces: almost as varied as the uses themselves. This paper is a network and UTS. These namespaces allow a container to case study of networking in containers and container-based present itself to the world and be routed to as if it were a clusters today. Specifically we will explore the most host. popular networking setups for Docker containers. We will Docker is comprised of various subcomponents that also explore the networking setups used in container-based manage container runtime, filesystem layers, and clusters by taking a deeper look at the networking aspects application images. The networking aspect is handled in of Kubernetes, Google’s container cluster manager [3]. container runtime setup. Docker’s containers are, by In this paper, when we refer to a container we are default, created by libcontainer, a native implementation of referring to a group of processes that are isolated via a set Linux container runtime[5]. It is primarily written in Go, of , cgroups, and capabilities.

Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada with operations outside of the Go runtime written in . . Libcontainer just expects the network pair to be Libcontainer provides a very low level API, exposing already set up and it coordinates adding one of the nearly all container-related kernel knobs. This makes interfaces to a newly-created container namespace (Figure libcontainer very powerful, but also difficult to configure at 2). times. In this case study we will use Docker’s use of libcontainer to showcase container network configurations. On host Configurations $ list Libcontainer exposes its container configuration via a ... single JSON configuration file [6]. It allows the 3: docker0: mtu 1460 configuration of the container’s hostname, routes, and qdisc noqueue state UP group default networks. Hostname is directly used to set up UTS link/ether 56:84:7a:fe:97:99 brd ff:ff:ff:ff:ff:ff namespace. The routes configuration passes in any route valid_lft forever preferred_lft forever table entries we want to install inside the namespace. 1037: veth694884b: mtu 1460 Network configuration allows specifying how the container qdisc pfifo_fast master docker0 state UP group default is exposed to the outside world. It allows the user to qlen 1000 configure the MAC address, IP (both IPv4 and IPv6) link/ether 12:0f:8b:dc:27:ac brd ff:ff:ff:ff:ff:ff address, and default gateway assigned to the container. It Inside container also allows the configuration of the MTU and queue length $ ip address list eth0 of the network device. 1036: eth0: mtu 1460 qdisc Docker containers can by default reach the outside pfifo_fast state UP group default qlen 1000 world, but the outside world cannot talk to the container. link/ether 02:42:ac:11:02:08 brd ff:ff:ff:ff:ff:ff To expose container endpoints, Docker requires explicit valid_lft forever preferred_lft forever specification of what ports to expose. A container can Figure 2. Example of a veth container setup in Docker inside and either expose all ports, or specific ones. Container ports outside the container. Container’s eth0 and host’s veth694884b can be bound to specific host ports or docker can pick interfaces form a veth pair linked to the docker bridge docker0. dynamic ports from a predefined range. All port mappings are managed through rules. Docker sets up a In practice, use of separate bridges can allow creating MASQUERADE rule for traffic from docker containers, virtual networks between containers. and rules to drop packets between containers unless In addition to setting up veth and bridge interfaces, specifically whitelisted (Figure 1). containers are linked together and with the wider world through iptable rules. This strategy with a network bridge $ sudo iptables -t nat -L -n is found to be significantly less performant than using the ... host network interface directly, since it requires Chain POSTROUTING (policy ACCEPT) forwarding/routing which is not penalty free. target prot opt source destination netns The netns strategy allows multiple containers to MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0 share the same network namespace and have access to the Chain DOCKER (2 references) same network devices. A new container can be configured target prot opt source destination by passing in the name of the container that has loopback DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 and/or veth configured. This option can also be used to tcp dpt:8080 to:172.17.2.13:8080 directly expose physical host networking to a container by Figure 1. MASQUERADE rules made by Docker on a host. sharing the host network namespace. MACVLAN/VLAN Libcontainer supports creation of Strategies VLAN and MACVLAN interfaces, but these are not used Libcontainer supports the following network directly by Docker today. Instead tools like pipework [7] configurations. can be used to set up complex networking configurations. loopback Creating a new network namespace always This strategy has the best performance of the options creates a loopback device. This strategy is primarily used mentioned here for environments where VLANs can be to enable the interface and perform any specific used. Libcontainer does not yet support the recently configuration or for containers that don’t require external released ipVLAN which works similarly to MACVLAN, networking. but switching using L3 [8]. veth veth (virtual ethernet) creates a pair of network interfaces that act like a pipe. The usual setup involves Networking in Container Clusters creating a veth pair with one of the peer interfaces kept in Container clusters are compute clusters where the primary the host network namespace and the other added to the unit of deployment is a container. In a container cluster, all container namespace. The interface on the host network jobs are run in containers and that becomes the primary namespace is added to a network bridge. In the Docker abstraction exposed to jobs (as opposed to a machine or a world, most of the veth setup is done by the Docker virtual machine). These types of clusters have risen in

Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada popularity as containers provide a lightweight and flexible each pod to be preferable over making ports a resource way to isolate and deploy applications. They’re also throughout Kubernetes. thought to be more efficient than the alternatives and thus provide higher overall machine utilization. Services Kubernetes is an open-source cluster container manager In Kubernetes pods are considered ephemeral: they can released by Google and based on over 10 years of running come and go. Machine maintenance can, for example, container clusters internally. It schedules, deploys, and cause them to be replaced with different instances that manages groups of containers to clusters of machines. serve the same purpose. This makes it inconvenient and Systems like Kubernetes are designed to provide all the incorrect to address a pod by its IP address. This is where primitives necessary for microservice architectures, and for services come in. A service is an abstraction that allows the these the network is vital. Applications perform nearly all stable addressing of a group of pods (sometimes called a their functions through the network. Discovery, microservice) [10]. It functions very similarly to a group of communication, and synchronization are all done through pods placed in front of a load balancer. A service has an IP the network. This makes the network setup of any that is guaranteed to be stable. Requests to that IP are then Kubernetes cluster one of the most important aspects. load balanced to active pods that are behind the service. As Kubernetes provides a set of key abstractions users can pods come and go, the service updates the routes it can use to architect their applications. We will examine two of provide to incoming requests. those abstractions, pods and services, and how their Services are implemented using a combination of networks are configured. iptables routes and a service proxy running on all Kubernetes nodes. When a pod on a node makes a Pod request to a service through the latter's IP address the A pod, like a pod of whales, is a group of tightly coupled iptables rules re-route the request to the service proxy containers [9]. These containers are always deployed side on the node. The service proxy keeps an updated list of all by side on the same machine. In a pod, all containers share the pods that can answer requests for this particular resources and share fate. All containers in a pod are placed service. The proxy watches the shared cluster state for inside the same network namespace such that they can talk changes in service membership and can enact the change to each other via localhost and share the port space. The quickly. Given the list of member pods, the service proxy pod is also assigned its own routable IP address. In this does simple client-side round robin load balancing across way pods are exposed outside of the Kubernetes node, are the member pods. This service proxy allows applications to uniquely addressable, and can bind to any network port. To function unmodified in a Kubernetes cluster. If the the external world, a pod looks like a virtual machine. overhead of the service proxy is a concern, applications Kubernetes pods are comprised of groups of Docker can perform the membership queries and routing containers. We create a “POD” infrastructure container themselves, just as the proxy does. We have not noticed a with a small no-op binary. The other containers in the pod significant performance impact from the use of the service are then created, joining the existing network namespace of proxy. We have work underway to replace the service the POD container. Having an infrastructure pod helps in proxy completely with iptables routes in order to persisting network setup as individual containers in the pod remove the need for the proxy. are added, removed, or restarted. Today, we use the veth Services in Kubernetes tend to not be exposed to the bridge networking strategy provided by Docker and assign outside world since most microservices simply talk to other it the pod’s IP address. While the performance is tolerable microservices in the cluster. Public-facing services are for now, we would like to replace veth with an ipVLAN extremely important as well since some microservice must setup instead. Kubernetes is unable to use MACVLAN eventually provide a service to the outside world. since most software defined networks (including Google Unfortunately, public-facing services are not completely Compute Engine’s) don’t support this option. handled today in Kubernetes since there is no external An alternate to the IP per pod approach is dynamic port proxy to act in a similar manner to the on-node service allocation and mapping. We’ve found that systems proxy. The current implementation of public services have choosing the latter approach are significantly more a load balancer targeting any node in a Kubernetes cluster. complex, permeate ports throughout all of their Requests that arrive on this node are then rerouted by the configurations, are plagued by port conflicts, and can service proxy to the correct pod answering requests for the suffer from port exhaustion. Ports become a schedulable service. This is an active area of work where a more resource which brings a tremendous amount of complexity. complete solution is still being designed. Applications must be able to change what ports they bind An alternative to the Kubernetes service abstraction is to to at runtime and communicate their port(s) to their users use DNS to redirect to pods. The reason this approach was and the service-discovery infrastructure. No longer can not chosen was due to the inconsistent handling of DNS applications communicate with an IP and a pre-defined lookups in DNS client libraries. Many clients do not port, they now must know the IP and the associated port. respect TTLs, do not round robin members, or do not re-do We find that the complexity of assigning a routable IP to lookups when membership changes, and we felt that DNS clients would not update membership quickly enough.

Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada Discovery cluster. There is also work underway to introduce “real” load balancing in the service proxy. This will allow the Services are the key to all communication within a load balancer to balance on things like the utilization and Kubernetes cluster. Discovering the IP address of those health of pods behind the service. services is done in two ways. The first approach is through environment variables exposed in the pods. Each pod in the cluster has a set of environment variables exposed with the Acknowledgements IP addresses of each service. Of course, this does not scale The authors would like to acknowledge and thank the well with larger clusters. More recently service discovery entire Kubernetes, Docker, and libcontainer teams, was delegated to an internal cluster DNS service. Each contributors, and communities. Their work is being service gets a DNS address in the cluster that resolves to highlighted and described here. the service IP. This was found not to have the DNS We would also like to thank John Wilkes and Mahesh problems mentioned before since the service IP address is Bandewar for their review and guidance on this paper. unique and stable. Configurations References We’ll briefly describe how the Kubernetes network is 1. Victor Marmol, Rohit Jnagal, “Containers @ Google”, configured in some of the more popular network strategies. Sunnyvale, CA, accessed February 6, 2014, Andromeda Andromeda is the software defined network http://www.slideshare.net/vmarmol/containers-google underlying Google Compute Engine [11]. Andromeda 2. Docker, “Docker”, accessed February 6, 2014, allows us quite a bit of flexibility in programing the https://github.com/docker/docker underlying network fabric. We allocate a subnet of 256 3. Kubernetes, “Kubernetes”, accessed February 6, 2014, internal 10-dot IPv4 IPs on each node to be used for pods https://github.com/googlecloudplatform/kubernetes and ask the fabric to route this /24 range to the specific 4. Daniel Lezcano, “.container.conf”, accessed February node. Service IP addresses are allocated from a different 6, 2014, http://man7.org/linux/man- part of the 10-dot pool. Pod to pod communication uses the pages/man5/lxc.container.conf.5.html internal 10-dot IPs, but communication with the internet 5. Libcontainer, “Libcontainer”, accessed February 6, gets NAT’ed through the node as those have internet 2014, https://github.com/docker/libcontainer routable IPs. 6. Libcontainer, “Libcontainer Config”, accessed February Flannel Flannel is an overlay network developed by 6, 2014, CoreOS [12]. Flannel uses universal TUN/TAP and UDP to https://github.com/docker/libcontainer/blob/master/config. encapsulate IP packets. It creates a virtual network very go similar to the one described above for Andromeda by 7. Jérôme Petazzoni, “Pipework”, accessed February 6, allocating a subnet to each host for use by its pods. 2014, https://github.com/jpetazzo/pipework Others Using a virtual bridge on a node and connecting 8. Mahesh Bandewar, “ipvlan: Initial check-in of the them across is being used to deploy more complex IPVLAN driver.”, accessed February 6, 2014, topologies. OVS on Kubernetes is enabled with such a http://lwn.net/Articles/620087/ setup [13]. Weave uses a similar technique to build an 9. Kubernetes, “Pods”, accessed February 6, 2014, overlay network [14]. https://github.com/GoogleCloudPlatform/kubernetes/blob/ master/docs/pods.md 10. Kubernetes, “Services”, accessed February 6, 2014, Resource Management https://github.com/GoogleCloudPlatform/kubernetes/blob/ At the cluster level, networking resources can be managed master/docs/services.md by detecting network capacity at each node and allocating 11. Amin Vahdat, “Enter the Andromeda zone - Google it to running containers like any other resource. However, Cloud Platform’s latest networking stack”, accessed a practical resource model for inter-cluster networking February 6, 2014, requires detailed topology information to be really http://googlecloudplatform.blogspot.com/2014/04/enter- effective. In most setups, network bandwidth management andromeda-zone-google-cloud-platforms-latest- responsibility is shared between the fabric provider and the networking-stack.html cluster manager. Given a topology model, containers can 12. CoreOS, “Flannel”, accessed February 6, 2014, be scheduled to (re-)balance the networking load. In https://github.com/coreos/flannel addition, when hotspots are identified, a node-local action 13. Kubernetes, “OVS Networking”, accessed February 6, can be initiated to limit bandwidth-hogging containers. 2014, https://github.com/GoogleCloudPlatform/kubernetes/blob/ Future Work master/docs/ovs-networking.md Kubernetes would like to move to virtual migratable IPs so 14. Zettio, “Weave”, accessed February 6, 2014, that container migration becomes possible within the ttps://github.com/zettio/weave

Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada