Budapesti Mûszaki és Gazdaságtudományi Egyetem Villamosmérnöki és Informatikai Kar Távközlési és Médiainformatikai Tanszék

Application of Extended Berkeley Packet Filters in Cloud Environment

Diplomaterv

Készítette Konzulens Ipari konzulens Bartha Csongor Dr. Simon Csaba Szabó Gergely

December 21, 2019 Contents

Kivonat 6

Abstract 7

1 Introduction8

2 Berkeley Packet Filter mechanisms and system components9

2.1 Concepts...... 9

2.1.1 Cloud computing...... 9

2.1.2 ...... 9

2.1.3 User- and kernelspace...... 10

2.1.4 System calls...... 10

2.1.5 Userspace packet-filtering...... 10

2.2 The classic Berkeley Packet Filters...... 12

2.2.1 Compiling cBPF...... 12

2.2.2 ...... 13

2.2.3 ...... 14

2.3 Extended Berkeley Packet Filters...... 14

2.3.1 How eBPF works...... 15

2.3.2 Code verification...... 16

2.3.3 Compiling eBPF...... 17

2.3.4 XDP - ...... 18

2.3.5 An eBPF vulnerability...... 19

2.3.6 The overhead of eBPF...... 20

1 2.3.7 Further possibilities...... 20

2.4 Monitoring toolset...... 21

2.4.1 Prometheus...... 21

2.4.2 ...... 21

2.4.3 AlertManager...... 21

2.5 Projects leveraging eBPF technology...... 21

2.5.1 Tools and libraries using eBPF...... 23

3 The main approaches of programming eBPF 26

3.1 Raw BPF...... 26

3.2 BCC - BPF Compiler Collection...... 26

3.2.1 Probe-types...... 27

3.3 High level tracing languages...... 28

3.3.1 BPFtrace...... 28

3.3.2 DTrace...... 30

3.3.3 Ply...... 31

4 eBPF in practice 32

4.1 Using eBPF in Kubernetes...... 32

4.1.1 Cilium...... 34

4.1.2 Weave Scope...... 35

4.2 Tracing and monitoring with BCC and BPFtrace...... 36

4.2.1 BCC...... 36

4.2.2 BPFtrace...... 37

4.3 BCC in practice...... 37

4.3.1 The BPF section...... 38

4.3.2 The Python section...... 40

4.3.3 Getting data from the kernel...... 41

4.4 The ebpf-exporter...... 43

4.4.1 Configuration and structure...... 43

2 5 The implementation in Kubernetes 45

5.1 The infrastructure of the Kubernetes cluster...... 46

5.2 Setting up Prometheus...... 47

5.2.1 General configuration...... 47

5.3 Creating the ebpf exporter...... 49

5.3.1 The Dockerfile...... 49

5.3.2 Privileged mode...... 50

5.3.3 Accessing the image from Google Cloud Platform...... 51

5.4 Deploying the monitoring stack in Kubernetes...... 52

5.4.1 Deploying the exporter...... 52

5.4.2 Deploying Prometheus...... 56

5.4.3 Deploying Grafana...... 62

5.5 The BPF programs in the exporter...... 63

5.5.1 Cachestat...... 63

5.5.2 Tcp-counter...... 64

5.5.3 The resulting time-series...... 66

6 Measuring system performance 67

6.1 The test system...... 67

6.1.1 Using Prometheus...... 67

6.1.2 The exporters...... 68

6.1.3 Visualizing the data with Grafana...... 72

6.2 Programs and scripts for testing...... 73

6.2.1 Testing tcp-counter...... 73

6.2.2 Testing cachestat...... 75

6.2.3 The spikes on the graphs...... 76

6.3 Discussion...... 78

7 Summary 79

Bibliography 84

3 Appendices 85

A.1 Filelife...... 85

A.2 Ebpf-exporter Dockerfile...... 87

A.3 The yaml file configuring the ebpf-exporter daemonset...... 88

A.4 The yaml file of the configMap for Prometheus...... 90

4 HALLGATÓI NYILATKOZAT

Alulírott Bartha Csongor, szigorló hallgató kijelentem, hogy ezt a diplomatervet meg nem engedett segítség nélkül, saját magam készítettem, csak a megadott forrásokat (szakiro- dalom, eszközök stb.) használtam fel. Minden olyan részt, melyet szó szerint, vagy azonos értelemben, de átfogalmazva más forrásból átvettem, egyértelmûen, a forrás megadásával megjelöltem.

Hozzájárulok, hogy a jelen munkám alapadatait (szerzõ(k), cím, angol és magyar nyelvû tartalmi kivonat, készítés éve, konzulens(ek) neve) a BME VIK nyilvánosan hozzáférhetõ elektronikus formában, a munka teljes szövegét pedig az egyetem belsõ hálózatán keresztül (vagy autentikált felhasználók számára) közzétegye. Kijelentem, hogy a benyújtott munka és annak elektronikus verziója megegyezik. Dékáni engedéllyel titkosított diplomatervek esetén a dolgozat szövege csak 3 év eltelte után válik hozzáférhetõvé.

Budapest, December 21, 2019

Bartha Csongor hallgató Kivonat

A modern felhő alapú megoldások egyik fontos komponense a Kubernetes konténer menedzs- ment rendszer. A Kubernetes klaszterekbe telepített nagy komplexitású szolgáltatások üzemeltetésének elengedhetetlen része a rendszert alkotó konténerek és egyéb erőforrások teljesítményének felügyelete, mivel az jelentős mértékben meghatározza a szolgáltatások minőségét.

A szakdolgozatomban bemutatom a hatékony felügyeleti lehetőségeket, valamint az álta- lam egy Kubernetes rendszer teljesítményének monitorozására kiválasztott Berkeley cso- magszűrő új változatát (extended Berkeley Packet Filter - eBPF). Az eBPF egy általános célú kernel mechanizmus, mely a felhasználói térben leírt csomagkezelési szabályokon túl egyszerű programok futtatását is lehetővé teszi a kernelben különféle csatlakozási pon- tok használatával ("probe"), ami által tetszőleges adatok nyerhetők ki az adott rendszerről. Részletesen bemutatom az eBPF programozás menetét és áttekintem a jelenleg elérhető fontosabb eBPF alapú felhasználási eseteket.

Bemutatom egy eBPF alapú Kubernetes monitorozási rendszer tervét, annak alkotóelemeit és a megvalósítás lépéseit. Mérések alapján megvizsgálom, hogy miként lehet alkalmazni két erőforrástípus (TCP hálózati forgalom és cache memória) monitorozására. Dolgozatomat az eredmények értékelésével és a jövőbeli bővítési lehetőségekkel zárom.

6 Abstract

The wide-scale adoption of container-based virtualization technologies is supported by the Kubernetes container management system, which provides the required tools to implement reliable and scalable services. It is inevitable to create a monitoring system for the con- tainers and all kinds of computing resources that make up a Kubernetes cluster, when operating services of great complexity, as it influences the quality of those services to a great extent.

In my thesis, I present, how the Berkeley Packet Filter, and especially its enhanced version (the extended Berkeley Packet Filter - eBPF) can be used for detailed monitoring of cloud systems. The eBPF is a mechanism, that lets executing packet filtering and other small programs of data collecting purpose inside the kernel, that was written in the , by using different kinds of probes. I present eBPF in detail, including its programming possibilities and its applications, with special focus on the networking and monitoring areas. I also illustrate the more important use cases related to eBPF, that are currently available.

I present the design of an eBPF-based Kubernetes monitoring system along with its com- ponents and also the detailed steps of its implementation. I examine, how this system can be used for collecting metrics and monitoring two kinds of resources (TCP network traffic and cache memory). I finish my thesis evaluating the results of the measurements with the help of some of the most popular monitoring tools, and also cover the topic of how this system could be extended in the future.

7 1. Introduction

Nowadays, more and more companies and projects shift their focus on container-based virtualization. Containers have taken the place of virtual machines in many different areas, especially in cloud services, due to the containers’ much higher performance and lower costs in the first place. The cloud model provides convenient, on-demand access for customiz- able, shared resources like servers, networks, services, etc. with ideal operational managing costs compared to traditional models. Cloud services are also scalable and provide high availability.

One of the fundamental concepts processed in this thesis is the Berkeley Packet Filter (BPF), which is an in-kernel virtual machine with the initial purpose of network packet filtering and processing [39]. The later iterations of BPF called Extended Berkeley Packet Filters (eBPF), however, offer many more possibilities - the ability to define programs in the user-space and executing them in the kernel in a protocol-independent way, while saving a great amount of resource usage [35]. These features make BPF a very useful tool for monitoring, logging, kernel debugging, security, packet processing, and so many more purposes.

The topic to be addressed in this work is how useful eBPF’s tracing and monitoring ca- pabilities can be in a Kubernetes cluster environment. I would also like to find out, which aspects of eBPF are worthy to use in this context, and what kind of metrics can I retrieve the most information from. I will write some custom tracing programs using eBPF, then import the collected results to Prometheus for processing and visualizing the collected data.

In the second chapter, I am giving a technology introduction, explaining the basic concepts that are crucial to know for this topic, and I also write about the advantages of the classic-, and the extended Berkeley Packet Filters. The chapter closes with research about projects and companies making use of eBPF. The third chapter focuses on the different ways of eBPF-programming mainly used for tracing with the help of toolkits, like the BPF Compiler Collection or BPFtrace. In the fourth chapter, I explain how BCC can be used in detail. The fifth chapter is about explaining the implementation of a monitoring stack, running BPF programs in Kubernetes. Then the sixth chapter is about storing and managing the extracted data from the cluster for evaluation. Finally, I conclude my thesis.

8 2. Berkeley Packet Filter mechanisms and system components

This chapter presents the monitoring and tracing capabilities of the different versions of the BPF filters, along with the concepts and components necessary to understand them. The chapter also contains market research about the more mainstream products and companies using these filtering and tracing technologies.

2.1 Concepts

In the following section, I am going to introduce the most important ideas and tools that make up as crucial building blocks and techniques for this topic, like how the kernel and userspace work alongside each other, and packet filtering in the userspace.

2.1.1 Cloud computing

By cloud, we mean a model for providing convenient, on-demand access for customizable, shared resources, that can be servers, networks, storage, applications and other services. From an administrator point of view, the operational costs of managing these services are minimal compared to traditional models. The other main advantages of using cloud services are scalability and high availability. Cloud services can be grouped in several ways. By access, we differentiate public, private, hybrid or community cloud. By the type of provided services, we can differentiate Software as a Service, Platform as a Service, Infrastructure as a Service Container as a Service, and many more. To learn more about the history and structure of cloud computing, it is recommended to read IBM’s article [27].

2.1.2 Kubernetes

Kubernetes, or k8s refers to a portable open-source platform for managing and orchestrating services, running in containers in a relatively easy and efficient way. It was originally developed by Google. It automates the deployment, scaling and updating of containerized applications, and possesses a self-healing capability as well with declarative configuration.

9 Kubernetes is ideal for running cloud-native [14] applications in clusters, that have hosts across public, private or hybrid clouds. More about Kubernetes can be found on its official website [41].

2.1.3 User- and kernelspace

On most operating systems, including Linux, the execution environment is logically di- vided into two parts, namely userspace and kernelspace [66]. The userspace location pro- vides memory for the normal user processes (other than the kernel) to run in, while the kernelspace memory stores the code and the processes of the kernel itself. The kernel can access both the kernelspace and userspace memory locations, but the reach of the userspace-processes is limited to their own part.

2.1.4 System calls

A system call is a way to interact with the for programs, running in userspace. It means a service request by a , addressed towards the kernel of the operating system, which executes that process. These kind of services can range from creating processes to device handling and memory management. It is made possible via an Application Programming Interface (API), triggering a software interrupt, which the kernel handles afterwards. The system calls are the only way of reaching the kernel for a userspace application. When a process makes a system call, a context switch happens between user- and kernel-modes, which is the procedure of storing the state of a process or a thread for later restitution. This way, the process can resume its execution from the same point, where it was halted.

2.1.5 Userspace packet-filtering

In the last years, the Linux kernel started to reach its limitations, as for high-performance packet filtering. Programmers started to develop tools and methods to overcome them, which led to the technique of kernel bypass. Kernel bypass means that the NIC (Network Interface Controller) and all packet processing can be handled from the userspace, while all packets skip going through the kernel’s networking layer, gaining a significant amount of increase in performance, and a similar rate of decrease in overhead. Kernel bypass also means running processes on the NIC’s processing unit instead of the host machine’s, but from the point of view of this thesis, the userspace controlling possibilities might be more interesting.

10 Iptables In the field of packet-filtering, the most frequently used application-types are firewalls. On Linux systems, in 1998 it was implemented by in Linux kernel v2.2 and was only capable of stateless packet filtering. Ipchains was the ancestor of the well known iptables Netfilter-module [45], which makes it possible to configure the firewall-tables provided by the Linux kernel. Iptables runs in the userspace and is capable of IPv4 stateless and stateful packet-filtering, packet modification and NAT (Network Address Translation) by predefined rules using specific chains.

By the original design, three main filtering purposes of iptables can be distinguished:

• INPUT chain - Protecting the local environment from unwanted incoming traffic

• OUTPUT chain - Keeping applications from sending specific traffic

• FORWARD chain - Filtering the packets, that the system routes or forwards

The problem with this flow is the list of rules bottlenecking the system. Every packet being processed has to be matched against each and every one of the respective set of rules. This means linear increase in costs with every rule added. In an environment that contains thousands of rules, it causes a massive impact in performance and latency, which makes it impracticable to use this kind of firewall solution.

Ipset To mitigate the problem (or at least the symptoms), that iptables raised, the release of ipset [54] came next, which became a part of the Netfilter framework [55]. Ipset is a utility for managing IP sets inside the Linux kernel. The sets can store various types of networking properties, like IP addresses or port numbers in hash tables, which serves as a compression method for storing the defined rules. This way wide performance growth has been achieved.

Unfortunately ipset didn’t solve every previously occured problem. For instance, Haibin Michael Xie, a senior architect of Huawei in one of his KubeCon talks [71] explains, that kube-proxies (one of Kubernetes’ [41] components, that uses iptables for NAT purposes) show unsteady latencies and performance decrease by increasing the number of services. Another serious disadvantage is iptables does not support the incremental updates of the rules, which means that adding a single new rule to the set implies replacing the entire list of rules, that is being expanded. In a scale of 20.000 Kubernetes services using 160.000 rules, it can take up to five hours just to do that.

General drawbacks of userspace networking

The major problems with userspace networking are the programs, that run in the user space need to manage the hardware directly, bypassing the kernel. These kind of user- space drivers are usually less tested, and get less support, than an operating system’s kernel. With bypassing the kernel, the user space programs bypass the kernel’s networking functionalities as well, so they have to re-implement it.

11 2.2 The classic Berkeley Packet Filters

The Berkeley Packet Filter (BPF) is an in-kernel virtual machine model and was first introduced and published in 1992 by Steve McCane and Van Jacobson in one of their publications [52]. It was originally created for filtering and processing packets. The Linux kernel first included BPF support with kernel version v2.5. With the use of (the classic) Berkeley Packet Filters, one can access raw socket interfaces to the data-link layer ("L2" - OSI model [65]) in a protocol-independent way. This makes monitoring and fast filtering of packets possible at the kernel-level. These filters can be defined in the user-space, and will be executed in the kernel, and are also architecture-independent.

There are numerous benefits of using BPF. One of them is that it doesn’t copy the packets to the user-space from the kernel, that are irrelevant to the defined filter. It drops them at kernel-level. This property, the dynamic changing of the filter modules and the JIT (Just- in-Time) compiling right in the kernel mean great performance gain over other solutions. Also, this legacy version of BPF has now been renamed to "cBPF" (for classic BPF), and now by BPF, we refer to the extended BPF, which is explained in a later section.

2.2.1 Compiling cBPF

The compilation of cBPF used to include the following steps: the code of the filter was compiled into with libpcap - explained in the following subsection 2.2.2, then it could be loaded into the cBPF virtual machine inside the kernel with the bpf() system call.

Operations

In BPF, several operations can be performed on the packet to be filtered. These can be defined with so-called "opcodes". Writing low-level BPF-filters is not a particularly easy job to do, but there is a built-in assembler into the kernel just for that purpose under the name of "bpf_asm" [22]. It corresponds to a BPF compiler, that translates a program into these opcodes. An exmaple BPF program that filters ARP (Address Resolution Protocol) packets, written in the syntax of bpf_asm looks like the one on listing 2.1. and it supports two output formats, presented on the compiled ARP-filtering code: -style (listing 2.2.) and raw (listing 2.3.).

There are three main elements of the classic BPF:

• A - a 32 bit long accumulator

• X - a 32 bit long register

• M[] - a 16x32 bit long array of registers, also known as "scratch memory"

12 Instruction types There are six main instruction types in BPF:

• LOAD - Copy a given value into A or X.

• STORE - Copy A or X into M[].

• BRANCH - Jump to a label by the given condition.

• ALU - Perform an arithmetical or logical operation on A or X.

• MISCELLANEOUS - Operations, that does not fit into the other categories.

• RETURN - Return from the filter.

Listing 2.1: The code for filtering and only allowing ARP packets (0x806) compiled by bpf_asm. The program loads a half-word into A-accumulator then checks if the packet’s type is ARP (0x806) and returns accordingly. ldh [12] jne #0x806, drop ret #-1 drop : ret #0

Listing 2.2: C-style output of the same ARP-filtering program compiled by bpf_asm. The columns stand for the opcode the jump target if the condition is true the jump target if the condition is false and the value of the scratch memory in this order. /* { op, jt, jf, k }, */ { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 3, 0x00000800 }, { 0x30, 0, 0, 0x00000017 }, { 0x15, 0, 1, 0x00000001 }, { 0x06, 0, 0, 0x0000ffff }, { 0x06, 0, 0, 0000000000 },

Listing 2.3: Raw (decimal) output of the same ARP-filtering program compiled by bpf_asm. 4, 40 0 0 12, 21 0 1 2054, 6 0 0 4294967295, 6 0 0 0

You can find more information about the syntax of BPF in the official documentation [39].

2.2.2 PCAP

PCAP (Packet CAPture) provides an API (Application Programming Interface) for cap- turing and filtering packets as well as monitoring networks. This API is written in C. In Unix-like systems, PCAP is implemented in the libpcap library, which was developed by the community members, who created tcpdump, what is explained in the next subsection.

PCAP uses a high-level filtering format, which - after compiling the filter with pcap_compile() - converts the given filter into BPF code. Then we can apply the compiled filter with the pcap_setfilter() function, and start the packet sniffing.

More information about the steps of packet sniffing can be found at the official webpage of tcpdump [4].

13 2.2.3 Tcpdump

Tcpdump is a program - running in the user-space - for dumping traffic on a given network interface according to filter-parameters, given in a boolean expression. Tcpdump prints out the content only of those packets, that match this expression. The program will capture packets by these parameters, until its process gets interrupted, or killed. More information about specifying filters and other parameters are documented in the official manpage of Tcpdump [58].

One of the more interesting facts about Tcpdump is that it uses PCAP (2.2.2) (which uses BPF) as it filters on the kernel-level. We can access the BPF-filter, generated by the program with the optional dumping switches. It has three levels of dumping the filter:

• -d: dump in human-readable bpf_asm format.

• -dd: dump in the form of a C program fragment.

• -ddd: dump as decimal numbers - in "BPF-raw" format.

Using these switches, the same kind of outputs can be generated as with bpf_asm, intro- duced on the listings in the former Compiling BPF subsection 2.2.1.

2.3 Extended Berkeley Packet Filters

There seemed to be some aspects to BPF, that didn’t prove to be sufficient over time. For example, the virtual machine behind BPF had become outdated, as the modern, con- temporary processors started using 64-bit registers. They also had their instruction sets improved, required for multiprocessor environments.

As a result, in 2013 cBPF was reshaped and extended with multiple functionalities and performance improvements. The first kernel version, that added support for eBPF, was v3.15 [32]. New elements, like maps appeared, and the JIT compiling was rewritten. Thanks to these upgrades, eBPF (extended Berkeley Packet Filter) can cut down the performance costs and the performance drop, caused by the hardware isolation by 25% to 33%, according to engineers at Microsoft [51].

Also, one of the more significant changes introduced new hooks in the kernel for the pro- grams to be able to be attached to. All of this means that now many new use-cases can be handled with eBPF programs, like intrusion detection, SDN configuration, DDoS mitiga- tion and more.

14 Other important improvements over cBFP:

• eBPF uses 64-bit registers and increased the number of available registers from two to ten. It also introduced more opcodes.

• It has got separated from the networking subsystem, hereby its capabilities and purposes grew significantly.

• It introduced the so-called "tail-calls", which gets around the limitation of the BPF program-size - which is maximized at 4096 bytes - by giving the ability to pass the control on to a new eBPF program.

These improvements and extensions in the virtual machine of eBPF are illustrated on figure 2.1.

Figure 2.1: The difference between the virtual machines of cBPF and eBPF 1

The eBPF functionality can be divided into two domains: kernel tracing or event moni- toring, and network programming. (The kernel tracing is possible without recompiling the kernel itself.) This thesis focuses only on the former.

Many eBPF example programs can be found at the Linux kernel sources’ repository [67].

2.3.1 How eBPF works

The eBPF programs can be attached to suitable code paths in the kernel. When the kernel reaches this point of the code, all of the attached eBPF programs get triggered and executed. Due to its ancestor, eBPF is still powerful at network programming, so it is possible to write programs with it that attach to a , and do network filtering or classifying tasks.

1Source: https://www.netronome.com/blog/bpf-ebpf-xdp-and-bpfilter-what-are-these-things-and- what-do-they-mean-enterprise/

15 The kernel is capable of restricting, which process can use which system call. This was achieved with BPF, which one can read more about in the Linux kernel’s docu- mentation [29].

2.3.2 Code verification

Verification of the BPF or eBPF code is essential, because malicious code can produce kernel crashes, and other major security issues. In the first versions, the cBPF bytecode was injected into the kernel from user-space, and the verification step came after that, which was followed by attaching the program to a socket. One of the most important conditions to check is that the eBPF code doesn’t contain any loops. Without containing any, the program can not stick into infinite loops, and hereby can not crash the kernel with it. To verify this, the in-kernel verifier runs a depth-first search (DFS) on the eBPF program’s control-flow graph. Unreachable instructions in the code are forbidden, as well as out-of bound jumps, and accessing data, that is out of range. Then the verifier simulates the execution of the program by stepping through its instructions one by one. It validates the state of the virtual machine before, and after executing the current step. Validating at this stage means checking the register and the stack’s state to see if the instruction doesn’t violate the previously mentioned forbidden premises. Another security feature is the uninitialized registers’ content can not be read. Violating any of these terms will cause the eBPF program to fail immediately. Figure 2.2 shows, where the verifier processes and supervises the BPF bytecode inside the kernel, compiled from the observability program in the user space.

Figure 2.2: The structure of performance analysis in eBPF using probes 2

2Source: https://www.slideshare.net/brendangregg/bpf-tracing-and-more

16 2.3.3 Compiling eBPF

Since classic BPF, the LLVM Clang [48] compiler has been updated to implement an eBPF- backend. It has made writing eBPF programs in C possible, which can be compiled into bytecode. The resulting , contained in object files can be loaded into the kernel using the bpf() system call. The kernel also provides a library - containing helper functions for loading eBPF programs - called libbpf. It functions as a wrapper over the bpf() and other syscalls regarding BPF, making the development easier. After the verification step, inside the kernel, the JIT compiler transforms the eBPF bytecode into platform-specific native code (assembly instructions), which is demonstrated on figure 2.3. I explain libbpf, the object files and other helper tools in more detail in subsection 2.5.1.

One minor drawback in this area is that in order to compile an eBPF program, it may have to include data structures from the kernel source, which is not always available for everyone. To overcome this problem, the BPF Compiler Collection (BCC) was developed, which I explain in more detail in a later section (3.2).

Figure 2.3: The compiling steps of an eBPF program all the way to native assembly instructions 3

At the end of 2019. august, some patches had been developed by Oracle, that made GCC (The GNU Compiler Collection) support eBPF, which adds another possibility for compil- ing eBPF next to LLVM-Clang. One concern about it is the interoperability of the eBPF programs, that the two different compilers create. The details of the GCC patch and also some more features under development can be read at the GCC mailing list [50] and in a Packt article [59].

3Source: https://www.netronome.com/blog/bpf-ebpf-xdp-and-bpfilter-what-are-these-things-and- what-do-they-mean-enterprise/

17 2.3.4 XDP - eXpress Data Path

The XDP Project uses eBPF for packet processing problems, but not in a usual manner. It provides a programmable data path in the kernel without the need of any specialized hardware. It also takes advantage of the capabilities of eBPF, so the packet processing can be done with very high performance. As for now, XDP can only be used for the RX- queue (incoming packets). A generic demonstration about how XDP communicates with the network devices and the control application is shown on figure 2.4. The figure shows, how the eBPF program can be loaded into the XDP packet processor, where it processes the packets coming from the network device (interface). The XDP (eBPF) program then can decide wether the packet should be dropped, processed locally via GRO (Generic Receive Offload [11]) or forwarded to another device or host.

Figure 2.4: The usage of the eXpress Data Path in the kernel 4

The advantage of eBPF in terms of XDP In this regard, eBPF has another major advantage over cBPF, because eBPF can use XDP as an additional hook, which cannot be found in cBPF. Its high performance capabilities are due to that XDP runs the filtering programs at the lowest level of the network stack, providing bare metal packet processing. It means that the program can start processing (dropping, redirecting or reflecting) packets immediately, before they are assigned an skb (socket buffer) metadata structure. This can cause up to 5 times performance acceleration. The performance of the different groups of hooks in the kernelspace and driverspace provided by cBPF and eBPF are compared on figure 2.5, given in Mpps (Maximum packets per second).

4Source: https://www.iovisor.org/technology/xdp

18 An XDP program can be loaded multiple ways:

• Generic - when the network driver lacks XDP support, but the kernel emulates it. This way, every packet has to reach the kernel, so there isn’t any performance growth.

• Native - when the driver itself supports XDP and the packets doesn’t have to reach the kernel to be processed.

• Offloaded - when the XDP program is loaded and executed entirely on the NIC (Network Interface Card).

More information about the XDP project can be found on IOVisor’s webpage [60].

Figure 2.5: The comparison of packet processing speeds using different kind of hooks, found in cBPF and eBPF (Mpps = Maximum Packets per Second) 5

2.3.5 An eBPF vulnerability

In december of 2017, a major issue of eBPF, named CVE-2017-16995 [8] appeared, as someone found an exploit in its verifier. The verifier seems to be a single point of failure in eBPF, regarding security. This vulnerability lead to give unprivileged users permissions to unlimited reading and writing arbitrary memory locations in the kernel. It was a very desperate situation, especially because eBPF was known for its strong security and safe- use features. For these kind of security concerns, some Linux kernelpatches (e.g. grsecurity [28]) limit eBPF’s capabilities.

5Source: https://tthtlc.wordpress.com/2018/12/31/applications-of-ebpf/

19 Virtualized environments do not have access by default to the features of eBPF: as for containers, the bpf() system-call is not granted to containers by default, and as it was mentioned in previous sections, the eBPF programs can only be loaded into the kernel with that call. The access control to this is implemented by the CAP_SYS_ADMIN capability [49], which has to be assigned to a given container, to be able to use the system call and hereby eBPF. Although, if the operating system is configured to enable unprivileged users running eBPF programs, then the system call is available in the containers that run top of it as well. To avoid this possibility, uses seccomp, that disables the bpf() call.

2.3.6 The overhead of eBPF

I have sought out an interesting benchmark related to the topic of the overhead of using eBPF. The benchmark uses the getpid() system call for reference and it examines three cases - one without any probes attached and two that increment BPF hash maps with different types of keys. The benchmark’s results and the files used for testing can be found on CloudFlare’s GitHub [12].

2.3.7 Further possibilities

Due to the fact, that the core of eBPF is a virtual machine (which runs in the kernel), this virtual machine can be used not exclusively in the kernelspace. This gives the possibility of using eBPF outside the kernel, and an interesting implementation of this case are the Smart Network Interface Cards.

Smart Network Interface Cards Smart NICs are a special kind of network interface cards, that are - unlike traditional NICs - equipped with a great amount of processing power and memory. The reason for these sort of unusual properties is that Smart NICs are capable of processing network traffic to the extent of full dataplane offload, and they can be programmed to implement arbitrary functionality as well.

To get a glimpse into the scale of the resources, spectacular examples would be Agilio’s SmartNICs [56], that feature 2 gigabytes of RAM (Random Access Memory) with up to 50 GbE Ethernet-ports, coupled with ARM11 processors. Smart NICs are also called Intelligent Server Adapters.

These appealing properties of the Smart NICs make them an ideal platform to use eBPF and XDP with.

Nic Voljoen and Jakub Kicinski’s extensive talk about offloading eBPF to hardware [64] is worth reading for more information about this topic. In this presentation, Nic claims that in some of his benchmarks, he managed to reach three million packets per second on each FPCs (Flexible PIC Concentrators, where PIC stands for Physical Interface Cards) on a Netronome SmartNIC.

20 Reverse BPF Only a few sources mention the concept of reverse BPF. Reverse-BPF is supposed to give the functionality of the in-kernel Network Interface Controller drivers exposing BPF back to the userspace as a generic program. Its main purpose would be creating hardware-specific data structures.

2.4 Monitoring toolset

There is a widely used tool chain, written as open source programs, deployed to monitor the performance of many virtualized and physical systems. In this section I present three important components actively maintained by the networking community.

2.4.1 Prometheus

Prometheus is a tool, that comes really handy when someone uses eBPF for tracing and monitoring. It is an opensource, cloud native solution, serving a lot of monitoring purposes, including collecting various metrics of a system and alerting. Prometheus uses in-memory and local disk storage methods, and also makes the storage scalable. It operates with a multi-dimensional data model, and provides a fitting query-language for accessing it, called PromQL (Prometheus Query Language). More information about Prometheus is available at their website [63].

2.4.2 Grafana

As for the displaying and visualizing the collected data, Prometheus provides multiple op- tions to choose from. It comes with Grafana integration which is an analytics platform, that is capable of producing graphs, heatmaps, histograms and many more forms of visu- alization for the sake of more understanding of the given dataset. Grafana is capable of much more, which you can discover at GrafanaLabs webpage [24].

2.4.3 AlertManager

The Alertmanager is tightly connected to the monitoring stack that Prometheus is also the part of. Alertmanager is a separate server that is able to handle incoming alerts from client applications, like Prometheus and provides grouping and deduplicating those alerts. It also supports notifying different recievers about the alerts, like email, PagerDuty, custom webhook implementations and more.

2.5 Projects leveraging eBPF technology

In regards of projects, there are more and more ongoing activities about many of the major ones starting to use the various possibilities of eBPF technology.

21 • Cilium is fully based on eBPF, which is an open source, secure networking plugin (CNI) for Kubernetes. Open vSwitch - a production-quality, multi-layer virtual switch implementation - is also working on an eBPF powered datapath [69].

• Cloudflare takes advantage of eBPF’s security-related capabilities. They have been working on an automatic DDoS mitigation system, using XDP [5].

• Facebook An application on the scale of Facebook must have a very efficient traffic optimizing infrastructure and toolset to flawlessly serve the billions of people access- ing its services. To satisfy the extremely high number of requests, they created a globally distributed network of servers, that initially used VIP (Virtual IP) along with L4LB (Layer 4 Load Balancer). These servers are connected to the datacenters, and act as proxies for them. In the older days, the engineers chose hardware-based solutions (ASICs, FPGAs) for load balancing, as they can take some of the load off the main CPUs, but this lead to limited flexibility. So Facebook needed a software-based load balancer as their L4LB. Its first version was the IPVS (IP Virtual Server) [46] kernel module, which also served as a DDoS-protection [44] solution, but it could not operate efficiently beside the other backend services, that the servers provided. Eventually, for similar purposes their engineers developed a tool named Katran [13], which is a software-based, scalable, high-performance load balancer. Recently, they have made it available to the public by open sourcing it. The specialty of Katran lies within its completely re-engineered structure of the forwarding-plane, compared to the previous iteration of their L4LB solution. Katran is based on XDP and eBPF, and it can safely run as their load balancer next to the backend applications, co-located on the same machine. Using these kernel technologies, their load balancing became more reliable and easily scalable, running with higher efficiency, and by changing to eBPF and XDP, they have achieved a growth of around ten times better performance compared to IPVS. To find out more about the changes in Facebook’s architecture, visit the official blogpost [57].

• Netflix mainly uses eBPF’s performance profiling and tracing abilities. Besides, Net- flix has been developing an open source, on-host performance monitoring framework for a while now, called Vector. This solution uses the essentials of Performance Co- Pilot (PCP) - that has been around a little bit longer - as a performance monitoring tool, but extends it with a transparent user experience, provided by a user-friendly UI surface. Lately, they have extended Vector with some eBPF visualizing options, like block- and filesystem latency heat maps or block IO top processes. More information about this improvement of Vector can be found on the Netflix Tech Blog [36].

22 2.5.1 Tools and libraries using eBPF

There are plenty of open-source projects that use, or provide tooling for eBPF. They can be sorted into three main groups by use as follows.

Networking

• Cilium - Cilium is a network-security and load-balancing solution for containerized systems. More details about it, and how it operates with Kubernetes is explained in chapter 2 (4.1.1).

• Suricata - Suricata is a fast, robust, open source engine, which is focused around the security aspects of BPF. It provides a network IDS (Intrusion Detection Sys- tem), IPS (Intrusion Prevention System) and NSM (Network and Security Manager) engine. Suricata uses eBPF for processing and filtering packets and a BPF-based load-balancer. It also uses XDP in order to be able to work with high packet rates. The repository is available on the GitHub page of OISF, the Open Information Se- curity Foundation [15].

- Systemd is a collection of one of the most important building blocks of Linux systems. At system startup, the systemd system manager’s process boots first with PID 1 (stands for "process ID"), and this process is responsible for starting every other necessary processes and daemons, that the rest of the system consists of. Among many other essential tasks, systemd has a per-service IPv4/v6 traffic- accounting capability and makes it possible to implement eBPF-based network access control ("ACL") solutions as well. More about these eBPF-based capabilities can be found in the respective blogpost [1].

- Iproute2 contains networking tools and utilities, that run in the userspace. It is part of the Linux kernel, and this suite is responsible for loading the eBPF programs into the kernel as ELF-files, that the LLVM compiler creates. Iproute2 supports both the XDP BPF, and tc (traffic control) BPF programs. Iproute2 is available at the official kernel repository [40].

• p4c-xdp - This project is a part of VMWare, and it has implemented a compiler back- end for P4, which corresponds to a domain-specific language that makes it possible to describe the packet-processing steps of programmable network elements, like switches and network interface cards. P4c-xdp can translate the P4 programs into eBPF C- programs, which then can run in the kernel after compilation in the XDP-layer. Another great function of P4 is getting rid of the platform-specific necessities, while writing programs for protocol-management. The workflow can be seen on figure 2.6, that shows, how p4c-xdp uses the BPF system call along with the match-and-action tables. The p4c-xdp repository is available on VMWare’s GitHub [68].

6Source: https://slideplayer.com/slide/16587554/

23 Figure 2.6: The workflow of p4c-xdp on programmable network hardware, us- ing eBPF 6

Tracing

• BCC - The name stands for BPF Compiler Collection. For explanation in details about its functionality and how it works, see the respective section (3.2) in chapter 2.

• bpftrace - It corresponds to a higher level, more easier to handle variant of BCC, which you can also read about in a later section (3.3.1) in chapter 2.

- This profiler tool was created by the Linux kernel community and it is part of the Linux kernel. It has several alternative appellations, like perf_events, PCL (Performance Counters for Linux) or LPE (Linux Perf Events). Perf gives the ability of retrieving and post processing the data from the loaded eBPF program using perf record subcommands. It functions in an event-oriented way and helps finding solutions among advanced conditions. Unfortunately, not all Linux tracing features are available via perf. There is another tool for getting a hold of the rest of these features, called [38]. To easily use the proper one that the current situation requires, Brendan Gregg created a collection, called perf-tools including both perf and ftrace, and made it available on his GitHub [25].

• ply - Ply is a dynamic, lightweight tracing tool, that is explained in a later section (3.3.3) in chapter 2.

24 • - Systemtap is not just a performance analysis tool, but refers to a script language as well. Some of its capabilities are extracting, filtering and summarizing retrieved data in case of performance problems. Its backend is called stapbpf, which translates the given script directly into eBPF code, which it injects into the kernel afterwards. It can perform all of this without the need of any additional compiler.

• PCP - PCP stands for Performance Co-Pilot, which serves as an extensible, cross- platform performance analysis framework and toolkit, providing system-level per- formance monitoring and management. It uses different kinds of agents for data collection. PCP also uses pmdabcc, which is a performance metrics domain agent based on BCC.

• Weave Scope - Weave Scope was created for cloud monitoring - collecting data from different processes and networking properties, connections (e.g. TCP events) with the help of eBPF and kprobes. It relies on the gobpf library. More features of the tool are explained later in section 4.1.2 in chapter 2.

Other purposes

• LLVM - LLVM is a collection of compiler technologies. The most important sub- project of it - as for BPF is Clang, which is an LLVM-native C/C++/Objective-C compiler, delivering remarkably fast compiling times. It performs about three times better than GCC [47]. It provides the backend for BPF, which makes it possible to compile BPF programs, written in C into a file, called ELF, which contains the compiled BPF-instructions.

• libbpf - This library is part of the Linux kernel. Its most relevant capability is it can load the ELF files - generated by LLVM - into the kernel. Libbpf is also used by other kernel projects.

• bpftool - It functions as an eBPF-debugging tool, developed by the Linux kernel community. Bpftool can dump the active eBFP programs’ instructions and also dump and modify the active eBPF maps. It provides support for interaction with the BPF filesystem.

• gobpf - Gobpf extends the BCC-framework with Go-bindings and low-level routines, helping to load eBPF programs from the EFL-files.

• ebpf_asm - Ebpf_asm implements an assembler for BPF programs. More about its use can be read in the Compiling BPF paragraph (2.2.1).

25 3. The main approaches of programming eBPF

3.1 Raw BPF

Raw BPF programs are not easy to write or read. When programming this way, the byte- code of the BPF program has to be included in the C code. In this part, it is the program- mer’s responsibility to manage registers and their values manually using built-in kernel macros that refer to CPU instructions. The code of an example raw BPF program can be found in the Linux kernel’s (v5.0) source [7], and a part of it can be seen on listing 3.1.. Reading the collected data from the maps and buffers is also cumbersome this way.

Listing 3.1: The array containing the bytecode of a BPF program embedded in C code in the form of kernel macros struct bpf_insn prog[] = { BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol), /*R0 = ip->proto*/) BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */ BPF_LD_MAP_FD(BPF_REG_1, map_fd), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ BPF_RAW_INSN(BPF_STX | BPF_XADD | BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0),/*r0 += r1*/ BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ BPF_EXIT_INSN(), };

3.2 BCC - BPF Compiler Collection

The BPF Compiler Collection (BCC) [31] provides a set of tools and utilities for writing, compiling and loading mainly kernel tracing programs with the help of extended BPF for performing performance analysis and traffic control. It makes the programming less difficult by wrapping the optional kernel instructions in C, and the front-end parts in Python and Lua. One of the convenient advantages is handling the data-structures, like eBPF-maps in usual ways, e.g. as Python-dictionaries. Besides that, BCC tries to give articulate feedback about errors, in case of the eBPF program failing, in which it explains the reasons of the failure.

26 3.2.1 Probe-types

There are eight different types of probes defined in the Linux kernel that BCC can make use of, which can also be seen in a structured way on figure 3.1:

1. Dynamic tracing

(a) Kernel-level • kprobes Kernel probes can be inserted at any instruction in the kernel in a dynamic way, and execute user-defined eBPF programs, when they are triggered. A kprobe creates a copy of the attached instruction, then changes its first bytes to a breakpoint. When the processor reaches the breakpoint, it triggers a trap (similar to an exception). Consequently the processor’s registers get saved and the control gets passed to the kprobe, that will start executing the eBPF program. • kretprobes Similarly to the kprobes, the kernel return probes are triggered, when the designated function, that the kretprobe had been attached to returns. (b) User-level • uprobes The uprobes handle the beginning of user-space function’s execu- tion. • uretprobes The uretprobes trigger, when a user-space function returns.

2. Static tracing

(a) Kernel-level • tracepoints Attach scripts to statically defined tracepoints. • Raw tracepoints These kind of tracepoints extend the basic static tracing functionality. • System call tracepoints Attach scripts to system calls in a static way. (b) User-level • USDT probes USDT is a short for User Statically Defined Tracing. This kind of probe can be placed in applications or libraries to give tracepoint-like functionality in the userspace.

The difference between static and dynamic probes is that the static ones are guaranteed to be stable on every kernel version, while the dynamic ones are not.

1Source: https://github.com/iovisor/bcc

27 Figure 3.1: The tracing tools provided by BCC 1

3.3 High level tracing languages

Since programming BPF is relatively ponderous, even using BCC, some developers created alternative solutions - different kind of frontends - for this purpose that are way easier to use. With these kind of high level languages, one can access most of BPF’s features, and almost all of its functionality in the form of "one-liner" codes and writing short scripts. This section presents some of these solutions.

3.3.1 BPFtrace

BPFtrace [33] is created by Iovisor - the developers behind the BPF Compiler Collection. It is a high level tracing language for BPF, written in C++. Its backend is provided by the LLVM [48] compiler, which converts the scripts to BPF-bytecode. It also uses libraries from the previously mentioned Berkeley Compiler Collection, just like existing Linux capabilities for tracing, like kprobes, uprobes and tracepoints. The probe categories are also similar to BCC’s, as seen on figure 3.2.

2Source: https://github.com/iovisor/bpftrace

28 Figure 3.2: The tracing tools provided by BPFtrace 2

The workflow of BPFtrace

Getting access to the functionalities of BPF in the easiest possible way has its cost. Since BPFtrace was designed for this purpose, the processes of compiling is a bit more compli- cated, and needs more steps, than in e.g. BCC.

When compiling, BPFtrace first uses the Lex & Yacc ("A Lexical Analyzer Generator & Yet Another Compiler-Compiler") parser [6], which examines the source code of the BPFtrace script and diagnoses its structure. The Lex part splits the code into so-called tokens, which then the Yacc part analyses in terms of hierarcical structure. The output of this first parsing step is an AST ("Abstract Syntax Tree") [2], which is then converted into LLVM IR ("Intermediate Representation") actions, which is explained in more detail in LLVM’s official documentation [10]. As the final step, the LLVM IR actions are transformed to BPF code.

The internal mechanisms of BPFtrace, and how it is interacting with the kernel is shown on figure 3.3. On this figure, the compilation flow, the parsing, the analyzing part and the use of bcc can also be seen. 2Source: https://github.com/iovisor/bpftrace

29 Figure 3.3: The internal mechanism of how BPFtrace uses BCC 2

3.3.2 DTrace

DTrace is one of the early high level tracing frameworks, which provides real-time trou- bleshooting and analyzing of applications running on production systems. DTrace was developed by Sun Microsystems, originally for a Unix operating system, called Solaris. It was released in January, 2005 as part of FreeBSD, and was ported for Linux in 2008. It has not lost its popularity - in March of 2019, Microsoft released their own build of DTrace for Windows 10 [53].

DTrace is capable of tracing resources in real time, like memory and processing perfor- mance along with networking and file system usage, and has the ability of not only tracing applications, but analysing the operating system as well. Writing tracing programs for DTrace is possible using the D programming language, which shows resemblance with awk.

Just like bpftrace, the user can attach probes with specified actions to be performed. A probe, that got triggered, can access the call-stack and other parts of the memory that are used in the given context, dump them into a database and even modify the variables. To paint a better picture about the situation, the probes can cooperate while analyzing. They do it by passing the information among each other via the modified context variables.

Three example commands of DTrace are represented in listing 3.2. for different use-cases, as explained in its embedded comments.

30 Listing 3.2: Example commands of DTrace written in D. # Files opened by process -n ’syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }’

# Syscall count by program dtrace -n ’syscall:::entry { @num[execname] = count(); }’

# Disk size by process dtrace -n ’io:::start { printf("%d %s %d",pid,execname,args[0]->b_bcount); }’

3.3.3 Ply

Ply is also a high level tracing solution, but it is a lightweight one. It uses the BPF virtual machine along with the capabilities of kprobes and tracepoints. Ply is written in C, and it compiles ply-scripts into BPF programs, following the Little Language approach. Unlike most tracing solutions, that use the LLVM-based BCC toolchain, ply does not require any external dependencies at runtime other than libc (the number of its buildtime dependencies are also very limited). As opposed to BPFtrace, ply can directly emit instructions, without using the LLVM IR-API.

An example for ply-scripting is shown on listing 3.3.. This script uses a kprobe to gather information about a proccess every time, it uses the open system-call to access a file. The information consists of the name and the PID of the calling process, and the name of the file to be accessed by it.

Listing 3.3: An example tracing ply-script called Opensnoop.

#!/usr/bin/env ply kprobe:SyS_open { printf("%16s(%5d): %s\n", comm(), pid(), mem(arg(0), "128s")); }

31 4. eBPF in practice

This chapter focuses on presenting, how eBPF is, and can be used in real world scenarios. In the followings, I am going to show, how eBPF comes handy inside a Kubernetes-managed cluster, and some other use-cases to illustrate, how eBPF works, and what can it do in practice. There are also numerous open-source tools developed for Kubernetes, which make use of eBPF. I selected those that are mature enough and related to my work, and discuss them in detail in the following subsections.

4.1 Using eBPF in Kubernetes

If someone deploys a Kubernetes cluster, and makes it available for public use, it can shortly become unsafe and unmanageable for the administrator. The users may start unwanted processes, which can eventually lead to very high use of the cluster’s resources, that seems untraceable at first sight.

There are a few methods for managing these kind of situations to choose from as enumer- ated below:

Checking the entry points manually The administrator can investigate the pod def- initions of the running containers to find their entry points, but that only shows the first program, that the given container has started. The entry point could start even millions of unwanted processes.

Kubectl trace Kubectl trace is a plugin for Kubernetes, that makes it possible to get access to the benefits of eBPF tracing in the cluster by scheduled use of arbitrary BPFtrace programs. It was created by IOVisor as well as the previous eBPF-related tools. This way, BPFtrace can be run against a Kubernetes pod or a node. When tracing is done on the pod-level, Kubectl trace helps resolving the context of the container’s pod, providing a variable that contains the PID of the root process in that container. Figure 4.1 illustrates the flow of an executed kubectl trace script, and also how it interconnects with a Kubenetes cluster. One example kubectl trace command for listing all the processes running in a given container is shown on the respective code snippet 4.1.

32 The only drawback of this is running the scripts manually only gives snapshot-like results about the current status of the system, and doesn’t monitor it all the time. Kubectl trace’s repository can be found at IOvisor’s GitHub page [34].

Listing 4.1: A kubectl trace one-liner script that lists all the running processes inside a container. kubectl trace run container -e \ "tracepoint:syscalls:sys_enter_execve { @[comm] = count() }"

Figure 4.1: The functioning of Kubectl trace on a Kubernetes cluster 1

BPF scripts in sidecar containers Maybe the most practical way to overcome these drawbacks at the moment is to use BCC tools inside the pods, deployed in so-called sidecar containers. This way, real-time monitoring and logging becomes available. In Kubernetes, sidecar containers are only used, when there are multiple containers needed to run beside each other in the same pod. In other words, when the containers inside a pod are tightly coupled, which means that the containers share the same network namespace, IP address and ports. To be able to observe the processes of a container from the sidecar container, they have to share their process namespace as well. From version 1.13, Kubernetes provides a configuration flag, called shareProcessNamespace, which can make every process of a container visible to every other container in the pod, that they share. More information about using this flag is available at Kubernetes’ webpage [42].

More information about using eBPF in Kubernetes is available at the respective Kubernetes blogpost [43].

1Source: https://github.com/iovisor/kubectl-trace

33 4.1.1 Cilium

Cilium is an open source project, integrated in Kubernetes. It is focused on networking, using BPF as one of its main components.

Its main capabilities are:

• transparently securing networks

• traffic routing and filtering

• load-balancing between application workloads

It operates at OSI Layer 3 and Layer 4 to offer traditional networking services, and also in Layer 7 to support the protection of modern applications.

The generic logic of Cilium can be seen on figure 4.2. It can be seen that a BPF program is mapped to each entity using networking services (e.g., NICs and containers), then the Cilium daemon offers an overlay to manage their networking activities by implementing a monitoring task, storing the required policies, providing a command line interface and assuring other orchestration actions through plugins. More about Cilium and how it works can be found at Cilium’s GitHub repository [17].

Figure 4.2: Cilium network controlling with eBPF 2

2Source: https://github.com/cilium/cilium

34 4.1.2 Weave Scope

Weave Scope is a troubleshooting and monitoring solution for microservice-based applica- tions running in Docker and Kubernetes systems, which also makes use of eBPF. It can be used without any configuration or integration whatsoever - it automatically detects the processes, the containers and the hosts of the system, and provides access to different metrics and metadata.

Weave Scope provides a real-time view of the containers and services, which helps finding the possible issues in them. A screenshot of Weave Scope in use is shown on figure 4.3. Note that Weave Scope is Kubernetes networking module agnostic, meaning that it can operate in Kubernetes clusters with CNI plugins other than WeaveNet, as well. The repository of Weave Scope is available at GitHub [70].

Figure 4.3: The interactive dashboard of Weave Scope, showing containers as circles, communicating with each other 3

3Source: https://thenewstack.io/how-to-detect-map-and-monitor-docker-containers-with-weave-scope- from-weaveworks/

35 4.2 Tracing and monitoring with BCC and BPFtrace

In the following section, I am going to showcase some of the capabilities of eBPF through using some of the examples scripts, included in the official BCC and BPFtrace libraries.

These example scripts can be found in BCC’s GitHub repository [31], covering the major use-cases of eBPF, and helping to understand the syntax, and how these tools work in general.

4.2.1 BCC

1. Execsnoop The execsnoop script is a tracing script, that keeps track of every new process that is started via the exec() syscall. This will not trace the processes, that are created via the fork() syscall though. Running execsnoop, while executing the "man" command in the background gives the output as seen on listing 4.2. As seen on the listing, the scipt is able to show not only the process ID (PID), but also the parent process ID (PPID) - the PID of the process, that started the given process along with the absolute path of the process’ binary.

2. Filelife This script traces the creation and deletion of files, their ages and names and also who created or deleted them, which is especially helpful when someone wants to investigate the short-lived files on the system (see listing 4.3.). It is also a great tool for debugging. The code of the filelife script is shown in Appendix A.1.

Listing 4.2: The output of the execsnoop BCC-script csongor@csongor-Latitude-5490:/usr/share/bcc/tools$ sudo ./execsnoop PCOMMPIDPPIDRETARGS bash 17029 9342 0/bin/bash lesspipe 17031 17030 0 /usr/bin/lesspipe basename 17032 17031 0 /usr/bin/basename /usr/bin/lesspipe dirname 17034 17033 0 /usr/bin/dirname /usr/bin/lesspipe dircolors 17036 17035 0 /usr/bin/dircolors -b man 17037 17029 0/usr/bin/man

Listing 4.3: The output of the filelife BCC-script csongor@csongor-Latitude-5490:/usr/share/bcc/tools$ sudo ./filelife TIME PID COMM AGE(s) FILE 18:10:13 1878 Chrome_IOThread 0.00 .com.google.Chrome.cIQ2Zv 18:10:13 1878 Chrome_IOThread 0.00 .com.google.Chrome.MDz595 18:10:13 1899 CompositorTileW 0.00 .com.google.Chrome.j3SbkG 18:10:13 1899 CompositorTileW 0.00 .com.google.Chrome.YeJiug 18:10:13 1878 Chrome_IOThread 0.00 .com.google.Chrome.3q7xEQ 18:10:13 1878 Chrome_IOThread 0.00 .com.google.Chrome.HtZYOq

36 4.2.2 BPFtrace

Execsnoop - While the code of the execsnoop script in BCC expands to about 220 lines, the same functionality is achieved with BPFtrace using only 10 - as seen on listing 4.4.

BPFtrace helps creating BPF programs that are humanly readable and significantly shorter and more understandable, than the ones written using BCC. At the same time, using BCC needs more understanding of BPF, the kernel functions and processing their data and it doesn’t hide the low-level operations, so every step of the program can be examined and defined, unlike with BPFtrace. As I have chosen to use BCC for the implementation part of this thesis for the reasons above, I am not going into much deeper details about the usage and syntax of BPFtrace, but it is really worth getting to know.

There are plenty of examples scripts in its repository [33], most of them implementing the ones that reside in BCC’s repository.

Listing 4.4: The code of the execsnoop BPFtrace-script #!/usr/bin/env bpftrace BEGIN { printf("%-10s %-5s %s\n", "TIME(ms)", "PID", "ARGS"); } tracepoint:syscalls:sys_enter_execve { printf("%-10u %-5d ", elapsed / 1000000, pid); join(args->argv); }

4.3 BCC in practice

A BCC script is basically a python script. It uses the python language as a wrapper for the C-like BPF code. The first, and most important thing of writing a BPF program with BCC is to import the BPF library, provided by BCC. Since it uses python, any other python libraries and methods can be used as well.

In BCC scripts, the python part of the code needs to have access to the BPF program (which is written in C). There are two ways to achieve this. The code of the BPF program may be embedded inside the python script, usually stored in a variable as a string. In this case, after importing the BPF library from BCC, the string, containing the BPF code can be given to a BPF object as a parameter for its constructor, but only with a "text=" part preceding it. The BPF code does not necessarily have to be stored in a variable, it can be written inside the argument part of the BPF object as a one-liner, or broken into multiple lines. The BPF object’s constructor can also accept BPF code referenced from a separate file. Listing 4.5. showcases a basic script, using a kprobe (explained later) attached to the sys_clone() kernel function, printing "Hello world" every time it triggers.

37 Listing 4.5: The code of a basic BCC-script from bcc import BPF BPF(text=’int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello, World!\\n"); return 0; }’).trace_print()

4.3.1 The BPF section

Inside this section lies the BPF code, written in a programming language, that shows strong resemblance to C. The developer can gain access to the kernel-level data with this part of the program. At the beginning part of the BPF-program, the BPF maps and macros can be defined.

BPF maps The BPF maps are an essential part of using BCC and BPF. These kind of special maps do not act like maps in a usual way, like in other programming languages. Most of the time they provide the best way to acquire somewhat structured data from the kernel by acting as helper objects for creating histograms, hashes or tables.

Their most "popular" forms are BPF_HASH, BPF_TABLE or BPF_HISTOGRAM, that serve slightly different purposes.

Function definitions

Every function, defined in the C-part are going to be executed on a probe, and most of them also must have a pt_reg * ctx argument. It is a pointer to a struct, that contains the registers and the current BPF context. It also can be substituted with the args struct, which is detailed at the tracepoints subsection (4.3.1). If the defined function does not use the ctx argument at all, the parameter can be cast to void (to avoid compiler warnings). If it is not the expected behaviour for maybe in the case of a helper function, then the ctx argument can be left out completely, but it must be defined as a static inline function.

There are two ways to define BPF-functions, which are detailed below.

Functions with prefixes

The first one is using the built-in prefixes and special functions for static and dynamic tracing possibilities. This is the simpler way for defining custom behaviour and also at- taching the function to a tracepoint, a kernel-function, or a user-defined function at the same time, without many further steps needed.

38 Kprobes, kretprobes This allows the dynamic tracing of the chosen kernel function calls, however the use of static tracing with tracepoints instead is highly recommended, because they have stable , unlike in the case of kprobes (or kretprobes) dynamic tracing functions.

To define a kprobe or a kretprobe with the prefixed function, the following syntax has to be used:

__(struct pt_regs * ctx, ...)

Following the ctx one, more arguments can also be used if needed, which will refer to the arguments of the given kernel function.

Tracepoints The tracepoint function (which in fact is a macro) looks a bit different:

TRACEPOINT_PROBE(category, event)

In this case, the category and event arguments are needed to define how the tracepoint should operate.

When using it, the arguments for the tracing events can be accessed through the args special struct via dereferencing. This struct gets loaded up with the appropriate data automatically, there is no need for initializing or defining it. To get to know, what categories of events can be traced and how to reach the arguments of the traced event, all the events and the format can be found in every event’s respective directory under the format file. The categories can be found at the /sys/kernel/debug/tracing/events location. One example of what a format file like mentioned above contains can be seen on figure 4.6.. This output happens to belong to the tcp_send_reset event and shows, what arguments are available to use from BPF.

Listing 4.6: The arguments of the tcp_send_reset kernel event name: tcp_send_reset ID: 1334 format : field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1;

field:const void * skbaddr; offset:8; size:8; signed:0; field:const void * skaddr; offset:16; size:8; signed:0; field:int state; offset:24; size:4; signed:1; field:__u16 sport; offset:28; size:2; signed:0; field:__u16 dport; offset:30; size:2; signed:0; field:__u8 saddr[4]; offset:32; size:4; signed:0; field:__u8 daddr[4]; offset:36; size:4; signed:0; field:__u8 saddr_v6[16]; offset:40; size:16; signed:0; field:__u8 daddr_v6[16]; offset:56; size:16; signed:0;

39 With this information, getting the TCP source port every time, the tcp_send_reset event occurs can be achieved the following way:

bpf_trace_printk("%d", args->sport);

The available tracepoint events can also be listed with the pref list command, which belongs to the linux-tools-common package.

Raw tracepoints The function (macro) for setting up a raw tracepoint to a kernel event is the following:

RAW_TRACEPOINT_PROBE(event)

The args field is automatically filled up with the chosen raw tracepoint parameters, as in the case of tracepoint definitions. The available raw tracepoints and their arguments can be found at the Linux kernel’s source at /include/trace/events.

System call tracepoints Using a system call tracepoint can be performed the following way:

syscall__(struct pt_regs *ctx, ...)

This method actually creates a kprobe for a system call, given after the syscall__ prefix. Following the ctx one, more arguments can also be used if needed, which will refer to the arguments of the given system call. In order to work, this function must be accompanied with calling the attach_kprobe method on the current BPF object in the python part of the code, but as far as parameters, the event= cannot be followed by the plain name of the system call. The name of the system call has to be given to the get_syscall_name() method of the BPF object first, which returns the related function name, then the kprobe can be created. (One example for the use of system call tracepoints would be tracing and logging, when a new directory is created in a file system along with the user-id, that performed the action. For this, the syscalls:sys_enter_mkdir tracepoint should be used.)

Defining the BPF functions with prefixes makes attaching them to the kernel functions automated, so no more coding needed at the python part in this regard.

4.3.2 The Python section

The Python section of a BCC script is where processing the collected data can be per- formed. Here can be defined, what part of the data, collected from the BPF program should be displayed and also in what form.

40 Normal functions

The other alternative of defining functions with prefixes in the BPF section is creating common C functions instead, and then attaching them in the python part afterwards to the given kernel functions. Basically both of the prefixed and the normal methods achieve the same goal using different ways. They make the attached BPF functions run every time the kernel function is called, which they have been attached to.

After the BPF object was initialized with the normal C functions, the

attach_(, ) method may be called on it, which most of the time takes two arguments (might take more in case of kretprobes). The first one is the name of the kernel function (written after the ’event=’ expression), which the BPF program should be attached to. The second one is the name of the custom function in the BPF program, which we have created before in the C section (written after the ’fn_name=’ expression).

Uprobes, uretprobes Uprobes and uretprobes work similarly to the kprobes and kret- probes, but they cannot be attached automatically with prefixed function names. The method for attaching them uses the following format:

attach_(, , )

There are some additional macros to use in these cases, like PT_REGS_PARM(), which contain the arguments of the examined function (uprobes), or PT_REGS_RC(), that contains the function’s return value (uretprobes).

USDT probes As the user-level equivalents of tracepoints, User Statically Defined Trac- ing probes can be used for tracing libraries or applications running in user space. At the function definition in the C section, the arguments of the examined application’s function are available via bpf_usdt_readarg().

A USDT python-object has to be instantiated for instrumentation with at least one pa- rameter: the process-id of the application. The probe will attach to this process. The next step is calling the enable_probe() method on the object with the probe and function name parameters. The function name refers to the custom BPF function, defined in the C section and the probe refers to the application’s probe. (The binary has to provide USDT probes in order to work.) After setting up the object, it has to be passed to the BPF object’s constructor in the usdt_contexts parameter.

4.3.3 Getting data from the kernel

There are two ways to write and read the content of the BPF maps.

41 Writing data bpf_trace_printk() This function is the more unstable way for printing. This method should only be used outside production systems, mainly for debugging purposes. The bpf_trace_printk() function has to be called inside the C-part of the code. It writes a file with the content, that was given at its parameter. That file is called the trace_pipe, which is located under /sys/kernel/debug/tracing/. This file can be accessed from the user space. There is one problem with this approach thoug.

The trace_pipe is globally shared, but does not support concurrency, so if more than one source (BPF program) tries to write it at the same time, only one will be able to do it, which means data loss. It also has some more limitations, including it can only handle one string argument.

BPF_PERF_OUTPUT() This BCC function is responsible for creating a BPF table, through which arbitrary event-data can be sent to the user space. This functionality is implemented using a ring buffer. This function also has to be called from within the C- part of the code. It takes one argument only, which defines the name of the BPF table to be created. After creating a table with a specified name, it can be used inside the BPF functions for pushing data into it.

To do that, a method, called perf_submit() has to be called on the table, that was created before. This method takes a context (usually a struct), the data to be written, and the size of the data as arguments, pushes the data into the buffer and returns 0 if it suceeded.

Reading data trace_fields() To read the contents of a the trace_pipe, the trace_fields BCC-function should be used, and to utilize its purpose, at the python part of the code (after defining the BPF-program in C), it should be placed inside a loop, so it can periodically poll the trace_pipe file. Every time this function is called, it reads one line from it and returns its contents as fields, so it can be easily processed in python. These returned fields are task, pid, cpu, flags, ts, msg in this order. open_perf_buffer() The open_perf_buffer() is a method, that can be called on a BPF object’s table in python. Its purpose is reading the contents of the BPF table (ring buffer), which was filled in the BPF program. It takes a function’s name as an argument, that will serve as a callback-method, that is connected to the table, so every time, when an event appears in the table (ring buffer), this given function will handle that event. The handling function should take three arguments - cpu, data and size in this order. Inside, the actual incoming event can be accessed and printed out. Lastly, checking the table for new events can be performed using the perf_buffer_poll() method on the same BPF object, which is supposed to be called inside an infinite loop.

42 4.4 The ebpf-exporter

In order to transfer the metrics from (BCC) BPF programs to Prometheus, an exporter was needed. At the time of creating this project, I have only found one stable Prometheus exporter for ebpf programs, called ebpf_exporter by Cloudflare [18].

Creating a custom configuration for the ebpf exporter is a non-trivial task. The exporter is able to run e.g. the BCC example BPF programs, but not in the same structure. The exporter contains (and runs) both the BPF program’s code and its configurations, both of which has to be defined inside a YAML-file, then passed to the ebpf exporter as its "–config-file" argument.

4.4.1 Configuration and structure

To transplant a BCC program into an ebpf exporter configuration, the code must be modified. The first and most conspicuous part is the lack of any python code - only the C code of the BPF program must be retained. What BCC performs with the python part in its scripts, the exporter can also do as yaml-formatted key-value structure.

The configuration file can be divided into the following parts:

Programs This part should contain the code of the BPF program (or more programs) itself in C. The BPF maps and structs defined here are automatically available to use in the other pars of the configuration.

Metrics This part is responsible for defining the metric type to be collected, which can be counters or histograms, depending on the data structures and maps defined in the BPF program.

Labels With these, the labels coming from the kernel maps can be translated into the desired label-format exposed in Prometheus.

Decoders Decoders take byte slices of defined length and transform them. They are useful for converting map keys to strings (static_map), translating kernel addresses to function names (ksym), filtering input with regular expressions (regexp)and so on. This is made possible by using the built-in decoders.

43 Tracing

Creating and attaching a kprobe or a tracepoint is different than it is done in BCC. Since there is no python part for processing the data, either of them can be instantiated at the root of the yaml tree with their own keywords. The syntax for them is writing their names as key, and referencing the name of the BPF function to be attached as its value. There can be multiple instances created from the same type.

The ebpf exporter also provides some built-in metrics, mainly for debugging purposes. They show every enabled BPF program that the exporter is running and also the kernel functions traced along with the names of the attached BPF functions. More information about the ebpf exporter can be found at Cloudflare’s GitHub repository.

44 5. The implementation in Kubernetes

In the chapter below, I am going to present my implementation and deployment of the ebpf exporter and Prometheus inside a Kubernetes cluster, running on Google Cloud Platform.

The usual way of using Prometheus in Kubernetes include not only monitoring the applica- tions that run inside the cluster, but also the resources of Kubernetes itself by discovering the targets automatically. These can be the API resources and nodes of Kubrenetes using kube-state-metrics, node exporters, etc.. Prometheus pulls these metrics, and then creates alerts by the user-defined rules, and pushes them into Alertmanager, which then can notify the reciever applications. For visualization, Grafana must set Prometheus as a data source for displaying the metrics as different kind of graphs. When I designed my application, I started from a full-featured monitoring architecture and implemented only selected fea- tures. Figure 5.1 shows the architecture of this monitoring system, however in this thesis, I have not implemented every part of it - mainly the ones, that are related to BPF.

Figure 5.1: A usual way of Prometheus-based monitoring in a Kubernetes cluster 1

1Source: https://sysdig.com/blog/kubernetes-monitoring-prometheus/

45 5.1 The infrastructure of the Kubernetes cluster

To implement the monitoring stack, I needed a cluster of physical or virtual machines (VMs), managed by Kubernetes. Creating virtual machines for this purpose on my host machine would not be the best option, because the resource intensity of running several virtual machines would have created a negative impact on the performance. So I have chosen using a public cloud provider - the Google Cloud Platform (GCP) for this task. My work can be used in any major public cloud system (e..g, Amazon Web Services, Microsoft Azure, etc.), but some parts of it might be needed to be modified to work with the respective API. My choice was based on my familiarity with the GCP API.

GCP has a built-in option for providing managed Kubernetes cluster, called the Google Kubernetes Engine (GKE). GKE makes creating Kubernetes clusters easy and almost effortless with numerous predefined possibilities, however it doesn’t give that much cus- tomizability at kernel level on the VMs (and the option for running BPF programs) as creating the cluster from scratch with virtual machines. To create virtual machines on GCP, I used the Google Compute Engine, and its API. I have created four VMs for the Kubernetes cluster - one running as the master node and three additional worker nodes using the "tf-kube-" prefix. The created and running virtual machines are shown on figure 5.2, as it can be seen on the GCP dashboard surface.

Figure 5.2: The virtual machines that run Kubernetes with their internal and external IP addreses in Google Compute Engine.

The instantiation of the virtual machines and the basic installation of Kubernetes on them was performed using Terraform [26] scripts, that I have created as a part of my previous university project. Terraform by HashiCorp gives the ability of creating Infrastructure as Code (IaC) for managing and provisioning cloud environments automatically.

The container network interface for Kubernetes

In order to connect the pods running in a Kubernetes cluster with more than one physical or virtual node, a container network interface (CNI) plugin has to be deployed on the cluster. There are many options for CNI plugins to choose from these days. My choice was Flannel [19] - an easily deployable plugin that creates a routable, Layer 3 (OSI) network among the Kubernetes cluster’s nodes. Flannel runs only one agent per node and provides subnet allocations for them, and to install it in the cluster, only one yaml file should be applied using kubectl.

46 The command line interface of Google Cloud Platform

To reach the cluster and the VMs that I have created, I used gcloud, Google’s command line tool for creating, managing and accessing (SSH) Compute Engine resources, like VMs or networks. This feature was the one that I have made use of the most. For example, to get a shell to one of the VMs of the cluster via SSH (after configuring, authenticating and initializing gcloud with the proper account), the following command should be executed:

gcloud compute ssh

After getting a shell to the master node with the command above (and with Flannel deployed), the Kubernetes cluster’s master and worker nodes can be listed via kubectl. The output can be seen on figure 5.3.

Figure 5.3: The nodes of the Kubernetes cluster are in ready-state.

More information about the usage of gcloud can be read in the Google Cloud SDK docu- mentation [20].

5.2 Setting up Prometheus

There are a few neccessary steps to properly set up Prometheus for monitoring. These are detailed in the next subsections.

5.2.1 General configuration

Prometheus functions only when the desired configurations and settings are defined in at least one yaml-file given to it as an argument. The mandatory file is called prometheus.yaml by default, which contains the following parts:

• global:

– This part defines the default scraping frequency of the given targets, the scraping timeout, the evaluation interval of the rules and also arbitrary labels can be defined to be attached to alerts or time-series in a key-value manner.

47 • rule_files:

– Multiple filepaths can be written here - these sources define the rules and alerts in the system.

• scrape_configs:

– Multiple scrape configurations can be provided at this part, specifying a set of targets and parameters for scraping them. Usually a job is defined by only one scrape config. The targets can be defined two ways: by static configs, hardcoded, or dynamically, using service-discovery.

• alerting:

– Alertmanagers: it contains alertmanager_configs, which specify the Alertman- ager instances, that Prometheus can forward alerts to. – Alert_relabel_configs: it defines, how the targets’s labels should be rewrited before scraping them.

• remote_write / remote_read:

– These configurations are related to labeling as well, but I haven’t used them in this project.

Alerting rules

In a separate file, the alerting rules can be defined by configuring conditions for them, and specifying, which external services should handle the notifications about them. The conditions can be written using the Prometheus expression language. Several details can be set for a rule beside its name and the condition - it supports pending alerting states, which means, that when the given condition is met, the alert will only switch to firing state, if the condition is still met after the predefined pending time interval. Different labels containing the severity of the alert can also be set, just like annotations.

Prometheus exporters

Since Prometheus collects metrics in a push-based way, the applications or services to be monitored should provide their metrics to it in a uniform format, that Prometheus understands. The agents or daemons that make it possible are called exporters. Prometheus is not able to develop exporters for every scenario, but encourages everyone to do so (although they have some exporters maintained and published at the official Prometheus GitHub). There are numerous exporters out there to choose from, developed by third parties. A small list of exporters is shared on Prometheus’s official documentation [61] for various purposes.

48 5.3 Creating the ebpf exporter

Before getting to run the ebpf-exporter inside Kubernetes, I had to create a Docker image to be able to run it in a container. Then to make sure that it worked, I created a container from the image and ran it with Docker on the host machine (in local environment).

5.3.1 The Dockerfile

In order to create a Docker image, a proper Dockerfile needs to be written beforehand. In the following I explain the structure of the Dockerfile, which is available in Appendix A.2.

• It uses ubuntu 18.04 as its base image.

• The second step is updating with the apt package manager, then installing the nec- essary utilities and commands using the RUN keyword. This part also includes the steps of installing the BCC-tools, but not in the simplest way, downloading the pre- compiled binary (bpfcc-tools) using the apt package manager. Although that would be way less complicated, there is a known bug in that tool, which doesn’t seem to be fixed ever since, which is discussed in more detail on the project’s GitHub page [30]. Because of this problem, I had to consider other options for installing the exporter, so I have chosen using the upstream stable and signed packages, and verified that they function correctly. This way, I had to separately install bcc-tools, the libbcc-examples and the linux-headers for the actual kernel-version on the host machine. As it has already been mentioned at the introduction of BCC, the kernel headers must be available if someone intends to run BPF programs using BCC. The headers’ version however needs to match the exact kernel version on the host machine, which can be updated. To address this problem to adapt to a dynamic enrivonment, it is preferred to use the linux-headers-$(uname -r) command for determining the all-time kernel version on any machine.

• Some additional installation steps was needed, which include sudo to get administra- tor rights, wget for downloading files over HTTP(s) or FTP(s), which is needed to download the neccessary files of Go. The GCC compiler and the git version control system is needed as well.

• The third step is downloading the Golang binaries using wget, then adding them to the PATH environment variable to allow calling it anywhere from the command line.

• The fourth step is downloading the ebpf-exporter itself using the go command line interface. Its get command takes care of both downloading and installing the given packages.

49 • The final steps are exposing the port 9435 to make the scraped metrics accessible from outside the container, then setting the CMD arguments, which consists of starting the ebpf_exporter binary with a config-file flag, that determines which BPF program to run and expose scraped data from. The config file has to be defined in a yaml- format. By default it referred a file named bio, which I intended to use for testing the exporter, but it lead to some interesting conclusions that I’m explaining in a case study in a subsequent subsection 5.4.2. So instead of using the test variants - before the CMD section - I am copying my own yaml file (exporter-configs.yaml) into the container. This file includes custom BPF programs and configurations that I have created, and which I’m explaining in more detail in a later section 5.5.

Building the Docker image

After the Dockerfile is created, the Docker image can be generated accordingly. To build it, I used the following command inside the folder where the Dockerfile resides (the dot at the end refers to the file path):

sudo docker build -t ebpf_exporter .

To check if the image has successfully been created, I used the following command:

sudo docker images

As seen on figure 5.4 created image can now be seen in the list of the Docker images.

Figure 5.4: The ebpf_exporter Docker image has been created.

Running the image

Docker uses a daemon for managing and running containers. Since the docker daemon connects to a Unix-socket on the system instead of using a TCP port, it must be run as root (or using sudo rights, if any other user wants to use it), because the Unix-socket belongs to the root user.

5.3.2 Privileged mode

By default, every docker container runs in an "unprivileged" mode, and not allowed to access any device. To overcome this problem, the Docker containers can be run as "priv- ileged" with the corresponding command switch, which can be seen below in the case of running the container based on the the ebpf exporter’s Docker image.

50 sudo docker run -d -p 9435:9435 –privileged ebpf_exporter

Trying to run the container without the privileged argument, the following error is gener- ated, which tells that in the lack of the privileged mode, the BPF program cannot read and write the contents of BPF maps, hereby fails:

"could not open bpf map: io_size, error: Operation not permitted 2019/10/02 22:38:37 Error attaching exporter: error compiling module for program ’bio’"

Running the command above - using the privileged mode, however is not enough to get the exporter working. It is also needed to mount some volumes to the container, and also expose the port to be able to get the collected data out of it. The volumes are set the following way. After the -v flag, the directory to be mounted follows given by the file path on the host machine. A colon comes next, followed by the path inside the container, where the volume is supposed to be mounted, then after a second colon, the access type has to be provided, which can be read-only (ro) or read-write (rw). The whole command with the neccessary volume mounts can be seen on listing 5.1.

Listing 5.1: The command for running the eBPF exporter inside a container. sudo docker run -d --rm -p 9435:9435 --privileged -v /sys/kernel/debug:/sys/kernel/debug:rw -v /lib/modules:/lib/modules:ro -v /usr/src:/usr/src:ro -v /etc/localtime:/etc/localtime:ro ebpf\_exporter

After starting the container in detached mode, it will begin to run as a background process. The container runs the BPF program, that was specified inside the yaml file, given at the "– config.file" argument of the ebpf exporter, and collects metrics from it, which are accessible via the exposed port on localhost.

5.3.3 Accessing the image from Google Cloud Platform

The next task was making sure that the VMs, running in the Compute Engine have access to the image of the ebpf exporter. The only way this can be achieved is by pushing it to the Google Container Registry (GCR), which refers to a secured, private Docker image repos- itory with configurable access control. The GCR also supports the vulnerability scanning of the uploaded images. More about GCR can be found in the official documentation [23].

I have pushed the image case I performed with the following command:

sudo docker push eu.gcr.io/ebpf-211311/ebpf-exporter

51 This command pushes the Docker image to the registry, where the host name is "eu.gcr.io", the project-id is "ebpf-211311" and the image name is "ebpf-exporter". After successfully pushing it, the image has shown in the GCP dashboard, as seen on figure 5.5 (it also can be listed via cli using gcloud). Now the VMs on GCP can access the ebpf-exporter’s image and Kubernetes can also work with it.

Figure 5.5: The ebpf_exporter Docker image is pushed to the Google Con- tainer Registry.

5.4 Deploying the monitoring stack in Kubernetes

To deploy the monitoring stack in Kubernetes, as a first step it is good practice to separate every resource connected to it into a new namespace. The namespace I have created is called "monitoring".

The goal of running the ebpf exporter in the Kubernetes cluster is to collect metrics from different resources and aspects of it, including the Kubernetes-specific resources as well, like pods and nodes. Because very few components of a Kubernetes cluster is static, I needed to create a solution that handles the changes and adapts to the dynamic qualities of it. It couldn’t be avoided, because at any time, nodes can be added or withdrawn to and from the cluster, but it cannot distract the exporter’s operation in any way whatsoever. The goal was running the exporter on every node, including every new node that joins the cluster too automatically. Luckily, Kubernetes has tools to solve this problem in the form of daemonsets.

5.4.1 Deploying the exporter

To make the ebpf exporter - that I had previously created a Docker image for - run in the Kubernetes cluster on every node, I had to create a Kubernetes resource daemonset. A daemonset provides that a copy of a specified pod is running on every node (or selected nodes) at all times. It adds a copy to every new joining node and garbage-collects the pods from the removed ones. Daemonsets are usually used by agents (Sysdig, Datadog), or in the case of cluster storage daemons (), collecting logs (fluentd), or node monitoring, like in my case.

52 Daemonsets, like most of Kubernetes’ resources can be created using yaml files beside using the command line arguments. I have chosen using the yaml method, because it is much easier to define arguments, when there’s a lot of them.

The first task was setting an annotation in the template’s metadata. It was really impor- tant, because without it, Prometheus could not find out, which pods to scrape, so I have added the "prometheus.io.scrape" parameter here, set to "true". This way, every pod that belongs to this daemonset will include this annotation, and Prometheus will know, that these pods make up the exporter to be scraped.

In the following, I will introduce some of the more important, remaining parts that the yaml file of the daemonset contains in its "spec" section. This yaml file is available in Appendix A.3.

• I set the hostNetwork parameter to true - this way the network namespace is shared between the daemonset and the host machine, and the ports open on the container will be opened on the host machine as well.

• The first essential part of the configuration is under the containers-section. Here I have defined the name of the container, that will be created, and for the "image" key, I have given the name of the image (along with its proper tag) of the ebpf-exporter Docker image, that I had previously pushed to the Google Container Registry. The image’s name this time must have contained the full name and path of the registry as well: eu.gcr.io/ebpf-211311/ebpf-exporter:latest.

• I set the opened container port to 9435 and set some resource usage limits for the daemonset.

• One of the most important parts of the configuration was setting the privileged field to true under securityContext. This achieves the same effect as using the –privileged flag, when running the container with Docker as described in one of the previous sections 5.3.2.

• The second most important part was the volumes-section, where I defined the name and path of the volumes to be mounted in the container. Kubernetes mounts these volumes by the information provided under the volume mounts-section, where I set the mount paths also similarly to the flags that were used when running the container using the Docker command line interface.

At this point, almost everything was ready to deploy the daemonset, however one au- thenticating step was still missing. If I had wanted to apply this yaml file in this state, it wouldn’t had worked, because the Google Compute VM instance, that serves as the master Kubernetes node does not have access to the Google Container Registry by default. To grant the access to it, first I had to create a service account on GCP. Service accounts (SA) are similar to user accounts, just meant for compute instances.

53 Service accounts act as identities which can be given to compute instances. This way they can run API requests in the name of the specified user. The instances and the related service accounts must be located within the same project and one instance is only allowed to have one service account.

For this purpose, I have created a service account, called "docker-puller" with the full name of [email protected]. Defining what a service ac- count makes an instance capable of can be set by granting it IAM (Identity and Access Management) roles. This SA was granted the IAM role of reading the cloud storage, which includes reading the Container Registry as well. The next step was attaching the service account to the compute instance, on which the Kubernetes master node is running, using the command on listing 5.2.

Listing 5.2: The command for attaching the service account to the master Kubernetes node’s instance gcloud compute instances set-service-account tf-kube-master --zone=europe-north1-a --scopes=storage-ro --service-account=docker-puller@ebpf -211311.iam.gserviceaccount.com

The resulting roles of the binding can be checked using the cli, or on the Google Cloud dashboard, among the details of the tf-kube-master instance, as seen on figure 5.6. The instance has only got read access to storage.

Figure 5.6: The tf-kube-master instance’s granted roles, after binding the ser- vice account to it.

54 Only after these configurating steps was I able to create the daemonset resource, based on the yaml file detailed above. I performed it by executing the following kubectl command on the master Kubernetes-node:

kubectl create -f daemonset.yaml

The state of the created daemonset now can be examined using kubectl, as seen on figure 5.7. The ready field shows three instances, because in my Kubernetes cluster, I use three worker nodes (remember fig. 5.3) and the daemonset has successfully been created on all of them. The "node selector" field’s value is "none", because I didn’t want the daemonset to run only on specific selected nodes - I wanted it to run on all the existing and also the joining ones.

Figure 5.7: The daemonset resource is in ready-state, running on all three worker nodes.

The pods created by the daemonset for each worker node is listed on figure 5.8.

Figure 5.8: The pods, created by the ebpf-exporter daemonset.

Getting the metrics

The metrics data, that the ebpf-exporter collects as a daemonset has become available at its exposed port on every worker node’s internal IP. It can be checked by sending a GET HTTP-request to :9435/metrics path. A small portion of the output can be examined on figure 5.9. This output is not quite user-friendly, but it wasn’t designed for human readability. The pods don’t even need to expose this port and metrics outside the cluster, because this format is the one that Prometheus can deal with (inside Kubernetes), and then make the information easy to work with for its users.

Figure 5.9: The metrics from one of the ebpf exporter daemonset’s pod, run- ning on worker-node-2, executing the bio.yaml BPF program.

55 5.4.2 Deploying Prometheus

After I had successfully deployed the ebpf exporter, the next task was deploying Prometheus in Kubernetes too, which included the following steps:

• Prometheus, running in Kubernetes can access its configuration file a bit differently, than normal. For this purpose, I have created a configmap Kubernetes resource, which can contain both the prometheus.yaml file and the rule configurations as well and I named it "prometheus-server-conf". I have created the configmap using a yaml file (see Appendix A.4). In the file, I have set the namespace to "monitoring", to make it available to every resource, that resides there. In the prometheus.yaml-section, I have set the scrape-, and evaluation-interval to be 15 seconds. As for the rule files section, I have set its value to "prometheus-rules" file (which is also defined in this configmap). At the scrape configs part, I have created a Prometheus job, called "kubernetes-pods". At this part, one of the most important parameters came next regarding monitoring in Kubernetes: kubernetes_sd_configs. This configurating option is the key for Prometheus to be able to scrape Kubernetes related resoureces via the Kubernetes REST API, while always staying synchronized with the state of the cluster using service discovery. To define, what kind of Kubernetes targets to scrape, it has to be given at the "role" parameter of kubernetes_sd_configs. There are five options here to choose from: node, pod, service, endpoint, ingress. In my case, I have set the role parameter to "pod", so the job can discover and expose the containers as scrape targets in each of them. This way, Prometheus can access the pods of the daemonset, running the ebpf exporter and collect their metrics. The last part of the file is about the relabeling configurations. Relabeling is an option that Prometheus provides to filter and modify unnecessary, sensitive or unwanted time-series and metrics, before storing them in the database. Here, I have set the configuration to only keep the metrics, that had obtained the prometheus.io.scrape source label. I also set the target label for the job and node parameter.

• To save this configmap yaml file as a Kubernetes resource, I applied it with the following command:

kubectl apply -f configmap.yaml

The check if the creation was successful, I used the kubectl command, seen on figure 5.10. The output also shows in the "data" field that there are two configurations defined in this map.

• Now that the configurations are ready and accessible within Kubernetes, the next step was deploying Prometheus itself, which I have created a deployment for.

56 Figure 5.10: The configmap for Prometheus has been successfully created, containing two configuration files.

Kubernetes deployments act as controllers for a set of pods - they perform pod- and replicaSet-updates to the desired states in a declarative way. I have created a yaml file for setting up the deployment. Obviously, the namespace had to be set to "monitoring" in this case as well. I have set the desired replicas to one, but this can be increased by scaling the deployment up later on. In the pod template-section, I have set the labels to "app=prometheus-server". This is an essential part, because services, that expose this deployment will only be able to select its pods by this label (selector). In the containers section, I set the image to prometheus:v2.10.0, not the latest to avoid possible incompatibilities in the future. The port was set to 9090/TCP, and I have also set two arguments for starting Prometheus in the containers: one for locating the configuration file, and one for storage. These file-paths are located inside the newly created containers, where I will mount the referenced files that will be created later on at the Volumes-section of the configuration. In that part, I have created two volumes, one for configuration and one for storage. The first one’s type I have set to "ConfigMap" with the name of "prometheus-server-conf", which I have previously created. Its optional parameter is set to false, because without it, Prometheus cannot start. The second volume’s type is set to EmptyDir, which refers to a temporary directory that only exists until the pod, that uses it. I have created the deployment similarly to the configmap. The result can be seen on figure 5.11.

Figure 5.11: The deployment for Prometheus has been successfully created, running one pod at the moment.

• At this time, Prometheus was already running, but I needed to access its web-based graphical user interface - the Prometheus dashboard. To achieve it, I needed to expose the deployment via a service, so I could reach its surface using a web browser. I created the service by writing another yaml file. After setting the namespace to "monitoring", in the metadata-section, I have set the labels to the same ones, that I had defined in the deployment configuration (app: prometheus-server). Under the spec part, I needed to set the ports for the service. By default, the type of every service is "ClusterIP", which makes the service accessible only from within the cluster. I needed to access it externally, so I have set the type to nodePort, which exposes the service on each node of the cluster on the same, statically specified port. I have set the nodePort to 30137. The targetPort was set to 9090 - this defines, on which port the application is listening inside the pod. After defining the ports, I had to set the selector parameter to "app: prometheus-server".

57 This way, the service could select the deployment’s pods and route traffic to them. The result of creating the service is shown on figure 5.12.

Figure 5.12: The service for Prometheus’ deployment has been successfully created with the type of nodePort, exposing port 30137 outisde the cluster.

• Unfortunately, I still wasn’t able to access Prometheus from a browser, using the IP address and the nodePort of the Google Compute instance, that the Kubernetes master node used. I could access it from inside the cluster using curl HTTP-requests, but that was not the expected operation. As it turned out, the problem was in the setup of Google VPC (Virtual Private Cloud) networks. The firewall wouldn’t enable any ingress traffic on the specified nodePort, so I had to create an explicit rule to fix it. I did the mofifications using the GCP dashboard, but it could also be performed via the gcloud cli. The firewall rule that I have created (with the name of k-ext) is set to the default network with ingress direction. It applies to all instances in the network. As for the source filter type, I have chosen "IP ranges" with the 0.0.0.0/0 netmask, which refers to all IPv4 addresses. I have also set the specified protocols and ports to "TCP:30137", which is the nodePort of the service exposing Prometheus. The created rule on the dashboard can be seen on figure 5.13.

Figure 5.13: The Google-VPC firewall rule, that enables incoming traffic to Prometheus, running inside Kubernetes.

Accessing Prometheus

After I configured the firewall, the service had become accessible from outside the cluster, using a web browser. Because of the nodePort (and the firewall-) setting, the Prometheus instance is available at the external IP address of any of the clusters’ nodes. (This could be reduced e.g. only for the master node.) I have accessed it on the master instance’s external IP, using the port previously opened - http://35.228.105.255:30137. This time, the dashboard showed up successfully, which can be seen on figure 5.14.

At the status button, under "Service discovery", Prometheus shows the services, that it have discovered from Kubernetes. Previously, in the Prometheus configuration yaml file, I have set the type of the discoverable resources to be "pods". At this page, we can see that Prometheus has successfully discovered all of the pods (22) in every available namespace (including "kube-system") along with their labels, but kept only the ones to be targets,

58 Figure 5.14: The dashboard of the Prometheus service, running in Kuber- netes. that carry the prometheus.io.scrape annotation, which is only true for the pods of the ebpf exporter’s daemonset. Figure presents the discovered pods and annotations as seen on the dashboard. In this figure, the relabeling can also be inspected for the "instance", the "job" and the "node" target labels, which have been converted from the pods’ source labels according to configuration of Prometheus. A part of the discovered pods can be seen on figure 5.15.

The discovered pods of the ebpf exporter can also be seen on the "targets" page of the dashboard. It correctly shows three endpoints to the kubernetes-pods target, because I run three worker nodes and only one is deployed on each of them. Their state with the proper labels attached to them is also shown here, as seen on figure 5.16.

Monitoring Kubernetes resources

Prometheus, when it is deployed in Kubernetes is generally used for monitoring the cluster’s resources, as seen on figure 5.1. As this is not the focus of this thesis, I have not implemented this part, but I am shortly summarizing, how Prometheus is able to monitor Kubernetes itself.

• A new ClusterRole has to be created, that gives authority for Prometheus to ac- cess Kubernetes’ resources. The ClusterRole should let getting, listing and watching the following resources: endpoints, nodes/proxies, pods, services, and getting the ingresses.extensions along with the /metrics non-resource URL.

• The ClusterRole then has to be applied with a ClusterRoleBinding to the ServiceAc- count that Prometheus uses in the proper namespace.

59 Figure 5.15: The discovered pods of the ebpf exporter in the Prometheus dash- board after relabeling the discovered Kubernetes labels.

The following components are provided by Kubernetes for monitoring:

• Kube-state-metrics is a metrics endpoint, running in Kubernetes as a service. It provides metrics about the states of objects, like nodes, pods, services. It gets the orchestration metadata from the Kubernetes API server. It runs as a deployment with one replica and can be scraped by Prometheus.

• Metrics-server is aggregating the data about resources-usage in the Kubernetes clus- ter. It is the successor of the now deprecated Heapster, and is not meant to provide long term storage of the metrics.

A more detailed description about this topic can be found for example on Freshtracks.io’s blog post [16].

60 Figure 5.16: The scraped endpoints of the kubernetes-pods target in the Prometheus dashboard.

The effects of kernel upgrades - A case study

While I was testing and documenting the ebpf exporter as I wrote it in the previous subsection, the exporter worked properly as expected. The snapshot on figure 5.16 was also taken then. However, as I progressed with the thesis, all at once the deployed exporter had stopped working for some reason. Meanwhile I have managed to find out, what caused the problem: The ebpf exporter was running the biolatency BCC BPF-program, configured in the bio.yaml file, as I have written at the deployment section. The biolatency BPF program (that I had deployed) was using kprobes by default. The program wants to create a kprobe for the blk_start_request() (and some others) kernel function and attach it to BPF functions defined below.

At the time when the exporter with this program was successfully running, the VMs that make up my Kubernetes cluster on Google Cloud Platform was using Linux kernel v4.15 on each machine, but as I have figured, Google regularly updates the kernel on the VMs, that use any of their stock images (for security reasons), so as I checked its version afterwards, it has been automatically updated to v5.0.0 on every machine.

The main issue here was that Linux kernel v4.20 was the highest version, that still included the blk_start_request() function. From kernel version 5.0 and up, this function does not exist, hence the ebpf exporter started failing as well. This kernel function and the related kernel versions can be found at the Linux kernel documentation [3].

This shows that kprobes do not have a stable API as opposed to tracepoints, which are preferred over them in regards of stability over kernel releases. There are different imple- mentations of this BPF program that use tracepoints for this purpose, but it showed the difference between them.

61 A solution for this could have been stabilizing the kernel version by using a custom image when creating the VMs, but instead I have changed the BPF program(s) for the ebpf exporter since then. These modifications in the BPF programs and the exporter are detailed in the next, separate section 5.5.

5.4.3 Deploying Grafana

First, I have created a deployment, called grafana-deployment, that will make sure that the given number of Grafana’s pods are running all the time. I have set the replicas to one, but it can be scaled up at will. The yaml file is fairly simple, it sets the containers’ image name to grafana/grafana - Grafana’s official Docker image, then sets the container port to 3000, because it serves at that port by default.

To expose the deployment, I had to create another service, called grafana-service with the type NodePort, which exposes port 3000 to a random port number (from a given range) on every node, since I haven’t set the NodePort explicitly. The port that the service got has been 31303. Just like in the case of exposing the deployment of Prometheus, I had to create a VPC firewall-rule for Grafana’s service too. It happened the same way as before, and the created rule can be seen on figure 5.17.

Figure 5.17: The ingress firewall rule on GCP for letting accessing Grafana’s dashboard.

Adding the rule gave access for Grafana’s dashboard, which then I configured to work with Prometheus.

Configuration

After successfully setting up the admin-account, I needed to set up the previously created Prometheus instance, as a data source in Grafana by setting its exposed IP address and port in the corresponding configuration fields. I didn’t use any authentication for this purpose, but there are several options for that if needed in a production environment. After the setup, Prometheus has become available under data sources on the Grafana dashboard, as seen on figure 5.18.

62 Figure 5.18: Promtheus has been added to Grafana as a data source.

5.5 The BPF programs in the exporter

In this section I am going to explain, what BPF programs I have loaded into the ebpf- exporter daemonset. For this purpose, I have created two BPF programs - one I have designed and modified based on available example programs and snippets provided by IOVisor (cachestat), and one I have implemented entirely myself from scratch (tcp-counter). I have set two BPF programs in the ebpf-exporter’s configuration in the same yaml file under the programs part, which are detailed in the following subsections.

5.5.1 Cachestat

This program is responsible for counting and showing the hits and misses to the file system page-cache memory. It does that by showing the page cache operations sorted by type.

It distinguishes four types of operations:

• add_to_page_cache_lru - adds a page to the LRU (Least Recently Used)-list.

• mark_page_accessed - marks the page, if it has been accessed from the cache.

• account_page_dirtied - marks the page to dirty state, which means that it has been written to since the last sync to disk.

• mark_buffer_dirty - writes to the cache.

More about the memory-management in the Linux kernel can be read at the official doc- umentation [37].

The BPF program defines a struct and a BPF hash map. The struct stores a 64 bit unsigned integer - this will store the address of the kprobe that the program is attached to - and a 128 bit character array, which is for storing the command type as a string. It is called key_t.

63 The next step is creating the hash map for counting occurences of the called command types. It can take the previously defined struct (key_t) typed variables as keys and its name is counts. The program only defines one function - do_count(), which takes the context as a parameter. Inside the function, a new key_t struct gets created and its parameter "ip" set for the address of the kprobe. To get this address, the PT_REGS_IP(ctx) macro can be used. The struct’s "command" field is set to the current process name via the bpf_get_current_comm() function. After filling the struct, it can be added to the hash map with the increment(key) method, which stores and increments the key’s value by one (by default), where the key is passed to it as an argument.

In the exporter’s configuration, four kprobes are created for this program and they are attaching the same do_count() BPF function to the four different kernel functions, that represent the memory-operation types. This means that every time, any of the four oper- ations happen, its kprobe’s address and the name of the operation is getting stored in the hash map (only if it’s not in the map already) along with the number of its occurence.

As for the rest of the configuration, under the metrics section, the counters’ "name" pa- rameter is set to page_cache_ops_total, so the collected metrics are going to be appended with this prefix. The "table" parameter is set to counts, which is the name of the BPF hash map, then two labels are defined for it, which refer to the two fields of the key_t struct. The first one is "op", which is the 8 byte address of the given kprobe, so at the "decoders" section, it has to be set to "ksym". This type of decoders take a kernel address and translates it to the function name. (It refers to the proc/kallsyms file, which stores the Linux kernel symbols of both the static and the dynamically loaded kernel modules.) The second label is for showing the command name as a simple, 128 byte string, so its decoder is of type "string", but it also uses "regexp" to filter out and show only the metrics that the systemd-journal or the syslog-ng processes generated. Lastly, the metrics are appended with the page_cache_ops_total prefix.

5.5.2 Tcp-counter

The second BPF program in the exporter is a custom one that I have created for experi- mental reasons. The program’s purpose is to trace the state-changes of TCP-connections in the kernel. It also filters out only the changes, that include a given set of the TCP-states.

In the code part of the configuration, I began writing the BPF program by defining a struct, also named key_t. This is for storing the old and the new state of a TCP connection, both using 8 bit integers, and a third, 128 bit character array stores the name of the current process. Next, I defined a BPF hash map, named counts, using the key_t struct type for its keys - just like in the previous program. It shows resemblance to the Cachestat program, but here I’m using tracepoints instead of kprobes for the tracing function. I implemented it using the prefixed TRACEPOINT_PROBE(), which takes two parameters: sock and inet_sock_set_state.

64 These arguments refer to the sock event category and the inet_sock_set_state event in the kernel. I have found the format of this particular event under /sys/kernel/debug/trac- ing/events/sock/inet_sock_set_state/format - it’s content can be seen on figure 5.19.

Figure 5.19: The format file of the arguments for the TCP-state change event.

Inside the function, I create an integer for saving the old and the new state of the con- nection. All of the arguments of the traced function are automatically available through the args variable’s fields. After initializing them, I implement the filtering for only a few of the old and new states, which means that only those changes will be sent to user space, which’s old and new state are both in the filtered set. The filtered integers are available in the format file stored in a map, showing the TCP states in text format.

The set that I am filtering in the program includes the following states:

• TCP_ESTABLISHED

• TCP_SYN_SENT

• TCP_SYN_RECV

• TCP_LAST_ACK

• TCP_LISTEN

• TCP_CLOSING

• TCP_NEW_SYN_RECV

65 If the event passes through the filter, then I store its states in a newly created key_t struct, along with the current process’s name. Finally, I’m incrementing the value that belongs to the struct’s key by one in the hash map.

Under programs section in the exporter configuration, I create the tracepoint with the tracepoints keyword. Its value is similar to the function definition in the BPF part, and at its value part, I have to reference the BPF-function in a specific way. Since it doesn’t have a custom, unique name because I use the prefixed version, the string here has to follow this strange syntax, so the whole tracepoint attachment looks like the following in the configuration:

sock:inet_sock_set_state: tracepoint__sock__inet_sock_set_state

For the metrics, I am using one counter, that gets the data from the counts BPF hash map, and I’m defining three labels for it: oldstate, newstate and command, which refer to the fields in the key_t struct defined in the BPF program. The oldstate is returned as a one byte unsigned integer, so I’m using a static_map for decoding it to more understandable strings - basically I’m doing the same mapping as in the BPF function, just in reverse. The newstate label is created the same way. The command label is a straight-forward 128 byte string.

5.5.3 The resulting time-series

One way to check if the exporter is working with the newly created configuration, using the previously featured BPF programs is using curl to send an HTTP request to the exporter’s service. It is working as expected as seen on figure 5.20.

Figure 5.20: The ebpf exporter is working properly, giving the expected time series from both of the BPF programs as seen on its REST API response.

66 6. Measuring system performance

In terms of measurements, my main intents were performing eBPF-based tracing on a Kubernetes cluster and finding out, what aspects of eBPF prove to be useful in this context. My test environment is an on-premise Kubernetes cluster deployed on the Google Cloud Platform, that has its design and setup detailed in the previous section. I generated different types of workloads in the test cluster, then depending on the workload, I have deployed the two BPF-based monitoring programs that I have created - also detailed in the previous section - to inspect their behaviour and collect relevant information about the system. Finally, I used Prometheus and Grafana to analyze and visualize the collected metrics.

6.1 The test system

For testing purposes, I have created a custom Kubernetes cluster in Google Cloud Platform which I have deployed with the help of Terraform [26], which is an Infrastructure as Code- tool for automating building and changing infrastructures in arbitrary cloud platforms.

I have executed the eBPF scripts on this cluster’s nodes, which would refer to virtual machines created in a cloud environment.

6.1.1 Using Prometheus

The Graph tab of the dashboard lets the user select, which kind of aggregated time-series should be shown, and also gives and expression field for the most important part - creating custom queries of the collected data. This expression-field makes it possible to draw up almost any kind of operation on aggregated time-series data in real time. It also provides functions to be used on the data, like rate of increase, logarithm, sorting and many more with the use of PromQL query language. More details about the PromQL functions can be found on the official Prometheus documentation [62].

The data, evaluated and returned by the PromQL-query can be shown in two ways in Prometheus: either on a console as plain text, that lists the selected time-series, or on a simple graph. Prometheus has a built-in solution for graphing the data, but its capabilities are limited. It can only create line-charts depending on the executed query expression.

67 6.1.2 The exporters

After deploying Prometheus and the ebpf exporter in the Kubernetes cluster, I have checked if it is getting the proper metrics from the system and forwarding them into Prometheus. The metrics forwarded by the exporter can be seen on figure 6.1 on top of the list. It exposes the top four metric types in the list: the two built-in types - ebpf_exporter_ebpf_programs (figure 6.2) and ebpf_exporter_enabled_programs (figure 6.3), and the two custom ones - ebpf_exporter_page_cache_ops_total and ebpf_exporter_tcp_msg_counter. The built in metrics show that there are two enabled BPF programs running on each node - cachestat and tcp-counter and also the names of the BPF functions used in them.

Figure 6.1: All the metrics exposed by the ebpf-exporter are visible in the Prometheus dashboard.

68 Figure 6.2: The built-in metrics of the ebpf_exporter_ebpf_programs as Prometheus discovered them using the ebpf-exporter as seen in the Prometheus dashboard.

69 Figure 6.3: The built-in metrics of the ebpf_exporter_enabled_programs, showing the discovered programs as seen on the Prometheus dash- board.

70 Cachestat

The cachestat BPF program is using counters, that show constantly growing values, but it is pretty much useless in terms of human interpreting in this form. It is recommended to use the rate() PromQL-function on counters, which shows the per-second rate of the time series given as its argument. It is also possible to set a time duration for rating between square brackets.

I am using the rate function in the expression field of Prometheus with five minute time duration. I checked the results on the Graph view of Prometheus, that can be seen on figure 6.4.

Figure 6.4: The metrics of the cachestat ebpf program on Prometheus’s Graph view, showing normal memory activity. Each color corresponds to a different memory operation on a node.

71 Tcp counter

The tcp counter BPF program is also using counters, so I used the rate function here as well. The results can be seen on figure 6.5.

Figure 6.5: Testing the metrics of the tcp counter ebpf program on Prometheus’s Graph view. Every color corresponds to a different TCP state-transition on each node.

6.1.3 Visualizing the data with Grafana

After testing the metrics in Prometheus, I switched to using Grafana to make the time series more humanly readable and customizable.

I have created a graph and a heatmap from the metrics, that Prometheus collects from the ebpf-exporter, running on the nodes of the Kubernetes cluster. There are many options to customize the graphs, after setting their data source (Prometheus).

72 I have used the same PromQL expressions for the graphs, as in Prometheus - rating the values, but with different time durations.

The graph I’ve created shows the collected data of the tcp-message-counter eBPF program, while the heatmap illustrates the results of page-cache-ops-total. The heatmap is set to show the greater values in brighter, while the smaller ones in darker shades of green. The graphs about the metrics of the last 12 hours interval are shown on figure 6.6.

Figure 6.6: An overview of a graph and a heatmap about the metrics of BPF programs running in the ebpf exporter (12 hour interval), as seen on the dashboard of the Grafana instance, running in Kubernetes.

6.2 Programs and scripts for testing

To test the system, I needed to run such applications that utilize the resources that the ebpf exporter is monitoring. The two BPF programs needed two different kind of stimulations, that I am explaining in this section.

6.2.1 Testing tcp-counter

As this program is related to monitoring network activities, I needed that kind of workload on the cluster. Testing it didn’t even need a separately deployed application.

Since HTTP uses TCP, it was easy to trigger by sending HTTP GET requests via curl command to given endpoints in the monitored Kubernetes nodes. I have decided to ssh into tf-kube-worker-0 and send HTTP requests to the ebpf exporter’s port number tf-kube- worker-1.

73 I sent the requests repeatedly using the watch -n 1 curl 10.166.0.4:9435 command, that executes the curl command every second. It has successfully created spectacular spikes on the graph in Grafana, seen on figure 6.7.

Figure 6.7: The result of sending repeated HTTP requests from one node to another in a 30 minute interval.

Thanks for exposing the process’ names in the BPF program, they can also be seen among the metrics fields. I have filtered out the metrics of tf-kube-worker-0, and at the spikes it shows curl and watch names, the commands that I have used for sending the requests. It also shows that the old state of the connection was TCP_SYN_SENT, that turned to TCP_ESTABLISHED. These filtered sections can be seen on figures 6.8 and 6.9.

Figure 6.8: The result of using the curl command on the node that sent the HTTP request.

Figure 6.9: The result of using the watch command on the node that sent the HTTP request.

74 The biggest spike belongs to tf-kube-worker-1, the node that recieved the requests. The command, that handled that was swapper, and the old state was TCP_SYN_RECV that switched into TCP_ESTABLISHED every time. It also formed a spike on the heatmap of Cachestat as expected, seen on figure 6.10.

Figure 6.10: SSH-ing into the worker nodes has also created a spike on cache- stat - shown in a 30 minute interval.

6.2.2 Testing cachestat

To test cachestat, I needed to access different memory locations that were not in the cache on the nodes. I have created a small test-program with the purpose of causing significant amounts of page faults, that can be exported and examined in Prometheus and Grafana on the graph of Cachestat. For the program’s language, I have chosen C because of its dynamic memory allocating capabilities.

In the program, first I am defining a block size of 512 megabytes in a macro, then in the main function, I am creating an array of type double. The type double takes 8 bytes, so this is allocating around 1,2 gigabytes of memory.

Then an iteration follows, in which I’m filling the array’s elements one by one using C’s malloc function cast to double. Malloc is allocating a memory block of a given size in bytes, but does not initialize it. Here, I am allocating the size of 512 megabytes (defined in the macro) for every element. To actually write a value into it, I am using the memset function. This writing section is iterated 30 times, then in another loop, I am reading values from the array at random indexes.

To access the program inside Kubernetes, I had to create a Docker image for it, then push it into the Google Container Registry. I used the GCC’s official Docker image (v9.2.0) as the base image, and I’m just starting the program in the CMD section. I had to tag the image the same specific way as before, then push it using the sudo docker push eu.gcr.io/ebpf- 211311/memtest command.

In Kubernetes, I created a job for running the memory testing program (memtest). Jobs are a resource type in Kubernetes, that create a specified number of pods, then make sure they reach completed state successfully. More about jobs can be found in the Kubernetes documentation [9].

75 Analyzing the results

First, I have cleared the cache on the worker nodes using the following command:

sudo echo 1 > /proc/sys/vm/drop_caches

After starting the job, the pods that it created began failing and exiting due to the fact that the 30 sized array, that’s every element is 512 megabytes would take up way more memory than a node has. This can be proved by SSH-ing into the nodes that had run these pods, and executing the following command, which checks the systemd journal’s contents, filtering the entries that refer to OOM, which stands for "out of memory":

journalctl -k | grep -i -e memory -e oom

The command has given the expected results - all of the tf-kube-worker-0 ’s RAM had been used up, so the process had been killed. The output is shown on listing 6.1.

Listing 6.1: The logs of journalctl about the out of memory error on one of the worker nodes Nov 07 22:27:02 tf-kube-worker-0 kernel: Out of memory: Kill process 22302 (memset) score 1868 or sacrifice child Nov 07 22:27:02 tf-kube-worker-0 kernel: oom_reaper: reaped process 22302 (memset), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

I checked the effects of this event in Prometheus and Grafana. On the graph of Prometheus (figure 6.11), the spike is clearly visible.

Detailed information is also available: the calling rate value of the add_to_page_cache_lru kernel function using five minute durations was around 139 on node tf-kube-worker-0 and the account_page_dirtied and account_page_accessed functions were also called about ten times more than usual in that time period. The heatmap in Grafana (figure 6.12) shows the same behaviour, just in another format: there has been a hit in the bucket of 140-154 using the same rating function as in Prometheus.

6.2.3 The spikes on the graphs

Both testing loads of the BPF programs (TCP counter - 6.2.1, Cachestat - 6.2.2) have generated spectacular spikes on the Grafana graphs and diagrams. Although the testing loads were repeated at the same time-intervals, they formed spikes on the graphs, because they show the rate of a counter typed metric, and the PromQL rate() function calculates the per-second average rate of increase of the time-series in a given range.

76 Figure 6.11: Running the memory testing program created a spike on the cachestat-graph in Prometheus.

Figure 6.12: The effect of running the memory testing program as seen on the Grafana heatmap.

77 6.3 Discussion

There are many possibilities to go on with for this project. Most importantly, the existing BPF programs could be implemented using BPFtrace as well, and the number and types of them could be extended to suit more applications and get more types of metrics, not only the ones related to resource usage (e.g. security related).

The monitoring stack could also be extended with the alerting feature. This could be im- plemented by defining a set of rules in Prometheus for stepping over given thresholds, and also deploying an Alertmanager instance in Kubernetes. The Alertmanager could be configured to have e.g. an email-typed receiver, that points to the address of an adminis- trator or operator. When an alert is firing in Prometheus, that would be forwarded to the Alertmanager that after a given interval of pending time, if the alert is still firing, would notify the reciever about the problem.

User space monitoring

There are also many more possibilities using the user space equivalents of kprobes and tracepoints - the uprobes and USDT. One interesting example of using them is shown in the case of tracing a MySQL server [21]. With these kind of probes, application-level monitoring could be implemented inside Kubernetes.

Other directions

A different way of implementing a cluster monitoring solution with BPF involved, would have been by using BPFtrace. For this method, kubectl trace could have been used too, which is based on BPFtrace.

Investigating the differences and similarities of BCC and BPFtrace would also be a really interesting topic in terms of what can and cannot be implemented with BPFtrace, that is possible to do using BCC and also how their instruction sets relate to one another.

78 7. Summary

As a part of this thesis, I have presented what Berkeley Packet Filter is - how it started, what it has become and what capabilities it holds for the future. BPF has been a part of the Linux kernel for a while now, but it has long been more than just a tool for packet filtering. It has widened the possibilities in the fields of high performance monitoring by writing tracing and filtering programs in the user space, and run them inside the kernel. It still provides networking functionalities and also serves as a great tool for debugging in the user- or the kernel space.

I have also evaluated, which methods are available to write BPF programs, that can be executed in the in-kernel virtual machine along with its restrictions and security consid- erations. From the available toolkits, I selected the most popular ones to try out from IOVisor - the Berkeley Compiler Collection (BCC), and BPF trace. They both provide many tools and helper functions for writing efficient BPF programs a little bit easier than writing them raw. In BCC, the BPF part of the program had to be written in C and the frontend part in Python or Lua, while BPF trace is a simplified version that saves a lot of work by using one-liners. However, I chose BCC for the tooling of the implementation part of this project.

I’ve inspected popular tools and applications that make use of BPF, and also made a research among some of the greatest IT companies about how they utilize its capabilities nowadays and where are they headed with BPF in the future.

I was curious to see, what eBPF can be used for in a cloud environment. It has turned out that it serves as a great aid for monitoring all kinds of resources in distributed services. As a demonstration, I have implemented a test monitoring stack in a Kubernetes cluster on Google Cloud Platform, and written a BPF program which then I had deployed in it. I had to learn, how Prometheus and its exporters are working, because I needed it for aggregating the collected metrics from the cluster. I wrote a BPF program that traces TCP connection states and also used one for inspecting memory management kernel functions, then deployed both of them in a Prometheus exporter on every node. The metrics from the exporter are pulled by Prometheus, where I’ve created different queries of them, that are visualized via graphs in Grafana. To see if the BPF programs and the flow works correctly, I have tested them using two different methods - one I have written a test program for, to generate the desired load - then analysed and evaluated the results.

79 Bibliography

[1] 0pointer.net. Ip accounting and access lists with systemd (2019 april 25.). http: //0pointer.net/blog/ip-accounting-and-access-lists-with-systemd.html.

[2] Gabriele Petronella (Blog.builgo.io). A tour of abstract syntax trees (2019 april 21.). https://blog.buildo.io/a-tour-of-abstract-syntax-trees-906c0574a067.

[3] Bootlin.com. Bootlin: Elixir cross referencer - blk_start_request kernel function doc- umentation (2019 october 28.). https://elixir.bootlin.com/linux/v4.20/ident/ blk_start_request.

[4] Tim Carstens. Programming with pcap (2019 march 10.). https://www.tcpdump. org/pcap.html.

[5] Gilberto Bertin (Cloudflare). Integrating xdp into our ddos mitigation pipeline (2019 april 6.). https://www.netdevconf.org/2.1/slides/apr6/bertin_Netdev-XDP. pdf.

[6] Compilertools.net. The lex & yacc page (2019 april 21.). http://dinosaur. compilertools.net/.

[7] Bootlin Elixir cross referencer. Sock_example ebpf code - the linux kernel source (2019 november 6.). https://elixir.bootlin.com/linux/v5.0/source/samples/ bpf/sock_example.c.

[8] Nvd.nist.gov National Vulnerability Database. Cve-2017-16995 vulnerability details (2019 may 16.). https://nvd.nist.gov/vuln/detail/CVE-2017-16995.

[9] Kubernetes documentation. Jobs - run to completion (2019 november 4.). https://kubernetes.io/docs/concepts/workloads/controllers/ jobs-run-to-completion/.

[10] LLVM.org LLVM Compiler Infrastructure Official Documentation. Writing an backend (2019 april 21.). https://releases.llvm.org/7.0.0/docs/ WritingAnLLVMBackend.html.

[11] DPDK.org. Generic receive offload library (2019 december 15.). https://doc.dpdk. org/guides/prog_guide/generic_receive_offload_lib.html.

80 [12] CloudFlare ebpf_exporter GitHub. ebpf overhead benchmark (2019 november 2.). https://github.com/cloudflare/ebpf_exporter/tree/master/benchmark.

[13] Facebookincubator. Katran - high performance load balancer github repository (2019 may 13.). https://github.com/facebookincubator/katran.

[14] Cloud Native Computing Foundation. Sustaining and integrating open source tech- nologies (2019 may 23.). https://www.cncf.io.

[15] OISF The Open Information Security Foundation. The suricata github repository (2019 may 13.). https://github.com/OISF/suricata.

[16] Bob Cotton Freshtracks.io. A deep dive into kubernetes metrics — part 6: kube-state-metrics (2019 november 2.). https://blog.freshtracks.io/ a-deep-dive-into-kubernetes-metrics-part-6-kube-state-metrics-14f4e7c8710b.

[17] Cilium GitHub. Cilium - aware networking and security using bpf and xdp (2019 march 21.). https://github.com/cilium/cilium.

[18] Cloudflare GitHub. Prometheus exporter for custom ebpf metrics. (2019 november 1.). https://github.com/cloudflare/ebpf_exporter.

[19] CoreOS GitHub. Flannel - a simple and easy layer-3 network fabric designed for kubernetes (2019 october 7.). https://github.com/coreos/flannel.

[20] Google GitHub. Gcloud command-line tool overview (2019 october 5.). https:// cloud.google.com/sdk/gcloud/.

[21] IOVisor BCC GitHub. mysqld_query: Trace mysql server queries. example of usdt tracing. for linux, uses bcc, bpf. embedded c. (2019 november 7.). https: //github.com/iovisor/bcc/blob/2479fbf1d3bc62a3170b2b289a49fb19972078c3/ examples/tracing/mysqld_query.py.

[22] ’ GitHub. The bpf assembler (2019 may 25.). https://github.com/ torvalds/linux/blob/master/tools/bpf/bpf_asm.c.

[23] Google. Google container registry documentation (2019 october 6.). https://cloud. google.com/container-registry/docs/.

[24] GrafanaLabs. Grafana - the open platform for beautiful analytics and monitoring (2019 may 23.). https://grafana.com.

[25] Brendan Gregg. Perf-tools github repository (2019 may 5.). https://github.com/ brendangregg/perf-tools.

[26] Hashicorp. Terraform - write, plan, and create infrastructure as code (2019 may 25.). https://www.terraform.io.

81 [27] IBM. Cloud computing: A complete guide - explore cloud computing and discover what it can bring to your enterprise (2019 november 20.). https://www.ibm.com/ cloud/learn/cloud-computing.

[28] Open Source Security Inc. Grsecurity official webpage (2019 may 16.). https:// grsecurity.net/.

[29] Infradead.org. Seccomp bpf (secure computing with filters) (2019 march 18.). https://www.infradead.org/~mchehab/kernel_docs/userspace-api/seccomp_ filter.html.

[30] Iovisor. Bcc binary issue discussion (2019 october 3.). https://github.com/iovisor/ bcc/issues/2119.

[31] IOvisor. Bpf compiler collection github repository (2019 march 12.). https://github. com/iovisor/bcc.

[32] IOvisor. Bpf features by linux kernel version (2019 march 31.). https://github.com/ iovisor/bcc/blob/master/docs/kernel-versions.md.

[33] Iovisor. Bpftrace github repository (2019 march 12.). https://github.com/iovisor/ bpftrace.

[34] IOvisor. Kubectl trace github repository (2019 april 27.). https://github.com/ iovisor/kubectl-trace.

[35] IOvisor.org. Extended berkeley packet filter documents (2019 december 12.). https: //www.iovisor.org/technology/ebpf.

[36] Brendan Gregg Jason Koch with Martin Spier and Ed Hunter. Ex- tending vector with ebpf to inspect host and container perfor- mance (2019 may 30.). https://medium.com/netflix-techblog/ extending-vector-with-ebpf-to-inspect-host-and-container-performance-5da3af4c584b.

[37] Kernel.org. Chapter 10 - page frame reclamation (2019 october 30.). https://www. kernel.org/doc/gorman/html/understand/understand013.html.

[38] Kernel.org. ftrace - function tracer documentation (2019 may 5.). https://www. kernel.org/doc/Documentation/trace/ftrace.txt.

[39] Kernel.org. Linux socket filtering aka berkeley packet filter (bpf) (2019 march 8.). https://www.kernel.org/doc/Documentation/networking/filter.txt.

[40] Stephen Hemminger Kernel.org. Index: iproute2/iproute2.git (2019 april 25.). https: //git.kernel.org/pub/scm/network/iproute2/iproute2.git/.

[41] Kubernetes.io. Kubernetes - production-grade container orchestration (2019 may 23.). https://kubernetes.io.

82 [42] Kubernetes.io. Share process namespace between containers in a pod (2019 april 30.). https://kubernetes.io/docs/tasks/configure-pod-container/ share-process-namespace/.

[43] Kubernetes.io. Using ebpf in kubernetes (2019 march 21.). https://kubernetes.io/ blog/2017/12/using-ebpf-in-kubernetes/.

[44] Kaspersky lab. What is a ddos attack? - ddos meaning (2019 april 2.). https: //www.kaspersky.com/resource-center/threats/ddos-attacks.

[45] Linux.die.net. iptables(8) - linux (2019 may 25.). https://linux.die. net/man/8/iptables.

[46] Linuxvirtualserver.org. The netfilter module for kernel 2.6 (2019 april 2.). http://www.linuxvirtualserver.org/software/ipvs.html.

[47] LLVM.org. End-user features - fast compiles and low memory use (2019 april 9.). http://clang.llvm.org/features.html#performance.

[48] LLVM.org. Llvm compiler infrastructure (2019 march 12.). https://llvm.org/.

[49] Man7.org. Overview of linux capabilities (2019 may 16.). http://man7.org/linux/ man-pages/man7/capabilities.7.html.

[50] Jose E. Marchesi. [patch v6 00/11] ebpf support for gcc (2019 october 30.). https: //gcc.gnu.org/ml/gcc-patches/2019-08/msg01987.html.

[51] Chris Hawblitzel Galen Hunt Jim Larus (Microsoft) Mark Aiken, Manuel Fahndrich. Deconstructing process isolation - acm sigplan workshop on memory systems perfor- mance and correctness | october 2006 (2019 april 17.). https://www.microsoft.com/ en-us/research/publication/deconstructing-process-isolation/.

[52] Steven McCanne and Van Jacobson. The bsd packet filter: A new architecture for user-level packet capture (2019 march 14.). http://www.tcpdump.org/papers/ bpf-usenix93.pdf.

[53] Hari Pulapaka Microsoft. Dtrace on windows (2019 april 21.). https://techcommunity.microsoft.com/t5/Windows-Kernel-Internals/ DTrace-on-Windows/ba-p/362902.

[54] Netfilter.org. Ip sets (2019 april 2.). http://ipset.netfilter.org/.

[55] Netfilter.org. Netfilter - firewalling, nat, and packet handling for linux (2019 april 2.). https://www.netfilter.org/.

[56] Netronome.com. Agilio cx smartnics (2019 april 17.). https://www.netronome.com/ products/agilio-cx/.

83 [57] Ranjeeth Dasineni Nikita Shirokov. Open-sourcing katran, a scalable net- work load balancer (2019 may 13.). https://code.fb.com/open-source/ open-sourcing-katran-a-scalable-network-load-balancer/.

[58] Tcpdump.org official webpage. Manpage of tcpdump (2019 march 8.). https://www. tcpdump.org/manpages/tcpdump.1.html.

[59] Amrata Joshi Packpub.com. Oracle introduces patch series to add ebpf support for gcc (2019 october 30.). https://hub.packtpub.com/ oracle-introduces-patch-series-to-add-ebpf-support-for-gcc/.

[60] IOVisor project. Xdp - express data path (2019 march 17.). https://www.iovisor. org/technology/xdp.

[61] Prometheus. Exporters and integrations (2019 october 9.). https://prometheus.io/ docs/instrumenting/exporters/.

[62] Prometheus. Prometheus querying functions (2019 october 26.). https:// prometheus.io/docs/prometheus/latest/querying/functions/.

[63] Prometheus.io. Prometheus - from metrics to insight (2019 may 23.). https:// prometheus.io.

[64] Nic Viljoen & Jakub Kicinski (Netronome Systems). ebpf offload to smartnics: cls_bpf and xdp (2019 april 17.). https://netdevconf.org/1.2/papers/eBPF_HW_OFFLOAD. pdf.

[65] Techtarget.com. Definition - data-link layer (2019 march 8.). https:// searchnetworking.techtarget.com/definition/Data-Link-layer.

[66] TechTerms.com. Technical terms: Userspace definition (2019 may 25.). https:// techterms.com/definition/user_space.

[67] Linus Torvalds. ebpf sample programs (2019 march 14.). https://github.com/ torvalds/linux/tree/v4.19/samples/bpf.

[68] VMWare. The p4c-xdp github repository (2019 may 13.). https://github.com/ vmware/p4c-xdp.

[69] William Tu (Open vSwitch). Offloading ovs flow processing using ebpf (2019 april 6.). http://www.openvswitch.org//support/ovscon2016/7/1120-tu.pdf.

[70] WeaveWorks. Weave scope - troubleshooting & monitoring for docker & kubernetes (2019 april 30.). https://github.com/weaveworks/scope.

[71] Haibin Michael Xie. Scale kubernetes to support 50,000 ser- vices (2019 april 2.). https://www.slideshare.net/LCChina/ scale-kubernetes-to-support-50000-services.

84 Appendices

A.1 Filelife

Listing A.1.1: The code of the filelife BCC python-script (continues on the next listing) #!/usr/bin/python # @lint-avoid-python-3-compatibility-imports # USAGE: filelife [-h] [-p PID] # Copyright 2016 Netflix, Inc. # Licensed under the Apache License, Version 2.0 (the "License") # # 08-Feb-2015 Brendan Gregg Created this. # 17-Feb-2016 Allan McAleavy updated for BPF_PERF_OUTPUT from __future__ import print_function from bcc import BPF import argparse from time import strftime

# arguments examples = """examples: ./filelife # trace all stat() syscalls ./filelife -p 181 # only trace PID 181 """ parser = argparse.ArgumentParser( description="Trace stat() syscalls", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=examples) parser.add_argument("-p", "--pid", help="trace this PID only") parser.add_argument("--ebpf", action="store_true", help=argparse.SUPPRESS) args = parser.parse_args() debug = 0

# define BPF program bpf_text = """ #include #include #include struct data_t { u32 pid ; u64 delta ; char comm[TASK_COMM_LEN]; char fname[DNAME_INLINE_LEN]; }; //...

85 Listing A.1.2: The second part of the code of the filelife BCC python-script //... BPF_HASH(birth, struct dentry *); BPF_PERF_OUTPUT(events); // trace file creation time int trace_create(struct pt_regs *ctx, struct inode *dir, struct dentry *dentry) { u32 pid = bpf_get_current_pid_tgid(); FILTER u64 ts = bpf_ktime_get_ns(); birth.update(&dentry, &ts); return 0; }; // trace file deletion and output details int trace_unlink(struct pt_regs *ctx, struct inode *dir, struct dentry *dentry) { struct data_t data = {}; u32 pid = bpf_get_current_pid_tgid(); FILTER u64 *tsp, delta; tsp = birth.lookup(&dentry); if (tsp == 0) { return 0; // missed create } delta = (bpf_ktime_get_ns() - *tsp) / 1000000; birth.delete(&dentry); struct qstr d_name = dentry->d_name; if (d_name.len == 0) return 0; if (bpf_get_current_comm(&data.comm, sizeof(data.comm)) == 0) { data.pid = pid; data.delta = delta; bpf_probe_read(&data.fname, sizeof(data.fname), d_name.name); } events.perf_submit(ctx, &data, sizeof(data)); return 0; } """ if args.pid: bpf_text = bpf_text.replace(’FILTER’, ’if (pid != %s) { return 0; }’ % args.pid) else : bpf_text = bpf_text.replace(’FILTER’, ’’) if debug or args.ebpf: print(bpf_text) if args.ebpf: exit ()

# initialize BPF b = BPF(text=bpf_text) b.attach_kprobe(event="vfs_create", fn_name="trace_create") # newer kernels (say, 4.8) may don’t fire vfs_create, so record (or overwrite) # the timestamp in security_inode_create(): b.attach_kprobe(event="security_inode_create", fn_name="trace_create") b.attach_kprobe(event="vfs_unlink", fn_name="trace_unlink")

//...

86 Listing A.1.3: The third part of the code of the filelife BCC python-script //... # header print("%-8s %-6s %-16s %-7s %s" % ("TIME", "PID", "COMM", "AGE(s)", "FILE")) # process event def print_event(cpu, data, size): event = b["events"].event(data) print("%-8s %-6d %-16s %-7.2f %s" % (strftime("%H:%M:%S"), event.pid, event.comm.decode(’utf-8’, ’replace’), float(event.delta) / 1000, event.fname.decode(’utf-8’, ’replace’))) b["events"].open_perf_buffer(print_event) while 1: try : b.perf_buffer_poll() except KeyboardInterrupt: exit ()

A.2 Ebpf-exporter Dockerfile

Listing A.2.1: The Dockerfile for the ebpf-exporter FROM ubuntu:18.04

RUN apt-get update && apt-get install -y sudo gnupg2 lsb-release ca-certificates && \ apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD && \ echo "deb https://repo.iovisor.org/apt/$(lsb_release -cs) $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/iovisor.list && \ apt-get update && \ apt-get install -y wget gcc git bcc-tools libbcc-examples linux-headers-$(uname -r)

RUN wget https://dl.google.com/go/go1.13.1.linux-amd64.tar.gz \ && tar -C /usr/local -xzf go1.13.1.linux-amd64.tar.gz

ENV PATH=$PATH:/usr/local/go/bin

RUN go get -u -v github.com/cloudflare/ebpf_exporter/... RUN echo $PATH

COPY exporter-configs.yaml /usr/share WORKDIR /root/go EXPOSE 9435 CMD ["./bin/ebpf_exporter", "--config.file=/usr/share/exporter-configs.yaml"]

87 A.3 The yaml file configuring the ebpf-exporter daemonset

Listing A.3.1: The first part of the yaml file configurating the Kubernetes daemonset resource for the ebpf-exporter apiVersion: apps/v1 kind: DaemonSet metadata : name: ebpf-exporter-ds namespace: monitoring labels : k8s-app: ebpf-exporter-monitoring spec : selector : matchLabels: # Label selector that determines which Pods belong to the DaemonSet name: ebpf-exporter template : metadata : labels : name: ebpf-exporter # This annotation lets Prometheus know to scrape its pods annotations: prometheus.io.scrape: "true" spec : # Node label selector could determine on which nodes Pod should be scheduled # nodeSelector: # type: some-type hostNetwork: true containers: - name: ebpf-exporter image: eu.gcr.io/ebpf-211311/ebpf-exporter:latest ports : - containerPort: 9435 protocol: TCP resources : limits : memory: 200Mi requests : cpu : 100 m memory: 200Mi volumeMounts: - name: varlog mountPath: /var/log - name: debug mountPath: /sys/kernel/debug - name : src mountPath: /usr/src readOnly: true - name: modules mountPath: /lib/modules readOnly: true - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true securityContext: privileged: true //...

88 Listing A.3.2: The second part of the yaml file configurating the Kubernetes daemonset resource for the ebpf-exporter //... terminationGracePeriodSeconds: 30 volumes : - name: varlog hostPath : path: /var/log - name: varlibdockercontainers hostPath : path: /var/lib/docker/containers - name: debug hostPath : path: /sys/kernel/debug - name: modules hostPath : path: /lib/modules - name : src hostPath : path: /usr/src

89 A.4 The yaml file of the configMap for Prometheus

Listing A.4.1: The yaml file describing a ConfigMap Kubernetes resource for configurating Prometheus apiVersion: v1 kind: ConfigMap metadata : name: prometheus-server-conf labels : name: prometheus-server-conf namespace: monitoring data : prometheus.rules: |- groups : - name: Ebpf-exporter alerts rules : # Just a dummy rule for illustration - alert: High Pod Memory expr: sum(container_memory_usage_bytes) > 1 for : 1m labels : severity: slack annotations: summary: High Memory Usage prometheus.yml: |- global : scrape_interval: 15s # Scrape targets every 15 seconds. evaluation_interval: 15s # Evaluate rules every 15 seconds. # Attach these extra labels to all timeseries # collected by this Prometheus instance. external_labels: monitor: ’ebpf-monitor’ rule_files: - ’prometheus.rules’ scrape_configs: # Scrape the pods (for daemon sets) - job_name: ’kubernetes-pods’ kubernetes_sd_configs: - role : pod # The following annotations was specified on the pods of the # ebpf-exporter daemonset: # prometheus.io.scrape: true relabel_configs: # This config filters the pods that has the annotation - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex : true - source_labels: [__meta_kubernetes_pod_namespace, __meta_kubernetes_pod_label_name] separator: ’/’ target_label: job - source_labels: [__meta_kubernetes_pod_node_name] target_label: node

90