DEGREE PROJECT IN ELECTRICAL ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2016

Linux Kernel Packet Transmission Performance in High-speed Networks

CLÉMENT BERTIER

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY Kungliga Tekniska hogskolan¨

Master thesis

Linux Kernel packet transmission performance in high-speed networks

Cl´ementBertier

August 27, 2016 Abstract

The Linux Kernel protocol stack is getting more and more additions as time goes by. As new technologies arise, more functions are implemented and might result is a certain amount of bloat. However new methods have been added to the kernel to circumvent common throughput issues and to maximize overall performances, given certain circumstances. To assess the ability of the kernel to produce packets at a given rate, we will use the pktgen tool. Pktgen is a dedicated to traffic generation based on UDP. Its philosophy was to be in a low position in the kernel protocol stack to minimize the amount of overhead caused by usual APIs. As measurements are usually done in packets per second instead of bandwidth, the UDP protocol makes perfect sense to minimize the amount of time creating a packet. It has several options which will be investigated, and for further insights its transmission algorithm will be analysed. But a software is not just a compiled piece of code, it is a set of instructions ran on top of hardware. And this hardware may or may not comply with the design of one’s software, making the execution slower than expected or in extreme cases even not functional. This thesis aims to investigate the maximum capabilities of Linux packet transmissions in high-speed networks, e.g. 10 Gigabits or 40 Gigabits. To go deeper into the understanding of the kernel behaviour during transmission we will use profiling tools, as and the newly adopted eBPF framework. Abstract

Linux Kernel protokollstacken blir fler och fler till¨aggsom tiden g˚ar.Som ny teknik uppst˚ar,fler funk- tioner har genomf¨ortsoch kan leda till en viss m¨angdsv¨alla.Men nya metoder har lagts till k¨arnanf¨or att kringg˚avanliga genomstr¨omningproblem och att maximera den totala f¨orest¨allningar,med tanke p˚avissa omst¨andigheter. Att fastst¨allaf¨orm˚aganhos k¨arnanf¨oratt producera paket med en given hastighet, kommer vi att anv¨andapktgen verktyget. Pktgen ¨aren laddbar k¨arnmodul till¨agnadtrafik generation baserad p˚aUDP. Dess filosofi var att vara i en l˚agposition i k¨arnanprotokollstacken f¨oratt minimera m¨angdenav overhead orsakad av vanliga API: er. Som m¨atningarnag¨orsvanligtvis i paket per sekund i st¨alletf¨orbandbredd, g¨orUDP-protokollet vettigt att minimera m¨angdentid p˚aatt skapa ett paket. Det har flera alternativ som kommer att unders¨okas, och f¨orytterligare insikter sin s¨andningsalgoritmenkommer att analyseras. Men en programvara ¨arinte bara en kompilerad bit kod, ¨ardet en upps¨attninginstruktioner sprang ovanp˚ah˚ardvara. Och den h¨armaskinvaran kan eller inte kan f¨oljamed utformningen av en program- vara, vilket g¨orutf¨orandetl˚angsammare¨anv¨antat eller i extrema fall ¨aven fungerar inte. Denna avhandling syftar till att unders¨oka de maximala kapacitet Linux pakets¨andningari h¨oghastighetsn¨at, t.ex. 10 gigabit eller 40 Gigabit. F¨oratt g˚adjupare in i f¨orst˚aelsenav k¨arnanbeteende under ¨overf¨oringen kommer vi att anv¨andaprofilverktyg, som perf och det nyligen antagna ramen eBPF. Contents

1 Introduction 5 1.1 Problem ...... 6 1.2 Methodology ...... 6 1.3 Goal ...... 7 1.4 Sustainability and ethics ...... 7 1.5 Delimitation ...... 7 1.6 Outline ...... 7

2 Background 9 2.1 Computer hardware architecture ...... 10 2.1.1 CPU ...... 10 2.1.2 SMP ...... 11 2.1.3 NUMA ...... 11 2.1.4 DMA ...... 11 2.1.5 Ethernet ...... 11 2.1.6 PCIe ...... 13 2.1.7 Networking terminology ...... 14 2.2 Linux ...... 15 2.2.1 OS Architecture design ...... 15 2.2.2 /proc pseudo-filesystem ...... 16 2.2.3 Socket Buffers ...... 17 2.2.4 xmit more API ...... 18 2.2.5 NIC drivers ...... 18 2.2.6 Queuing in the networking stack ...... 19 2.3 Related work – Traffic generators ...... 20 2.3.1 iPerf ...... 20 2.3.2 KUTE ...... 20 2.3.3 PF RING...... 20 2.3.4 Netmap ...... 20 2.3.5 DPDK ...... 21 2.3.6 Moongen ...... 21 2.3.7 Hardware solutions ...... 21 2.4 Pktgen ...... 22 2.4.1 pktgen flags ...... 22 2.4.2 Commands ...... 23 2.4.3 Transmission algorithm ...... 24 2.4.4 Performance checklist ...... 27 2.5 Related work – Profiling ...... 28 2.5.1 perf ...... 28 2.5.2 eBPF ...... 29

1 3 Methodology 33 3.1 Data yielding ...... 33 3.2 Data evaluation ...... 34 3.3 Linear statistical correlation ...... 34

4 Experimental setup 35 4.1 Speed advertisement ...... 35 4.2 Hardware used ...... 36 4.2.1 Machine A – KTH ...... 36 4.2.2 Machine B – KTH ...... 37 4.2.3 Machine C – Ericsson ...... 38 4.2.4 Machine D – Ericsson ...... 39 4.3 Choice of ...... 40 4.4 Creating a virtual development environment ...... 40 4.5 Empirical testing of settings ...... 41 4.6 Creation of an interface for pktgen ...... 41 4.7 Enhancing the system for pktgen ...... 43 4.8 pktgen parameters clone conflict ...... 44

5 eBPF Programs with BCC 45 5.1 Introduction ...... 45 5.2 kprobes ...... 45 5.3 Estimation of driver transmission function execution time ...... 46

6 Results 49 6.1 Settings tuning ...... 49 6.1.1 Influence of kernel version ...... 49 6.1.2 Optimal pktgen settings ...... 49 6.1.3 Influence of ring size ...... 52 6.2 Evidence of faulty hardware ...... 53 6.3 Study of the packet size scalability ...... 54 6.3.1 Problem detection ...... 54 6.3.2 Profiling with perf ...... 55 6.3.3 Driver latency estimation with eBPF ...... 56

7 Conclusion 58 7.1 Future work ...... 58

A Bifrost install 62 A.1 How to create a bifrost distribution ...... 62 A.2 Compile and install a kernel for bifrost ...... 63

B Scripts 64

C Block diagrams 66

2 List of Figures

2.1 Caches location in a 2-core CPU...... 10 2.2 Theoretical limits of the link according to packet size on a 10G link...... 12 2.3 Theoretical limits of the link according to packet size on a 40G link...... 13 2.4 , the mascot of Linux ...... 15 2.5 Overview of the kernel [4] ...... 16 2.6 How pointers are mapped to retrieve data within the socket buffer [18]...... 17 2.7 Example of a shell command to interact with pktgen...... 22 2.8 pktgen transmission algorithm ...... 26 2.9 Example of call-graph generated by perf record -g foo [38] ...... 29 2.10 Assembly code required to filter packets on eth0 with tcp ports 22...... 30

3.1 Representation of the methodology algorithm used ...... 33 3.2 Pearson product-moment correlation coefficient formula...... 34

4.1 Simplification of block diagram of the S7002 motherboard configuration [46, p. 19] . . . . 36 4.2 Simplification of block diagram of the ProLiant DL380 Gen9 motherboard configuration. . 37 4.3 Simplification of block diagram of the S2600IP [47] motherboard configuration...... 38 4.4 Simplification of block diagram of the S2600CWR [48] motherboard configuration . . . . . 39 4.5 Output using the –help parameter on the pktgen script...... 43

6.1 Benchmarking of different kernel version under bifrost (Machine A) ...... 50 6.2 Performance of pktgen on different machines according to burst variance...... 51 6.3 Influence of ring size and burst value on the throughput ...... 52 6.4 Machine C parameter variation to amount of cores ...... 53 6.5 Machine C bandwidth test with MTU packets...... 54 6.6 Throughput to packet size, in million of packets per second...... 54 6.7 Throughput to packet size, in Mbps...... 55 6.8 Superposition of the amount of cache misses and the throughput ”sawtooth” behaviour. . 56

C.1 Block diagram of motherboard Tyan S7002 ...... 66 C.2 Block diagram of the motherboard S2600IP ...... 67 C.3 Block diagram of the motherboard S2600CW ...... 68 C.4 Patch proposed to fix the burst anomalous cloning behaviour ...... 69

3 List of Tables

2.1 PCIe speeds ...... 14 2.2 Flags available in pktgen...... 23

6.1 Comparison of throughput with eBPF program ...... 56

4 Chapter 1

Introduction

Throughout the evolution of network interface cards to high-speeds such as 10, 40 or even 100 Gigabit per second the amount of packets to handle on a single interface has increased drastically. Whilst the enhancement of the NIC is the first step of a system to handle more traffic there is a inherent consequence to it: the remainder of the system must be capable of handling the same amount of traffic. We are in an era where the bottleneck of the system is shifting towards the CPU [1], due to a more and more bloated protocol stack. To ensure the capabilities of the operating system to produce or receive a given amount of data, we need to assess them through the help of network testing tools. There are two main categories of network testing tools: software and hardware based. Hardware network testing tools are usually seen as accurate, reliable and powerful in terms of throughput [2] but expensive nonetheless. While software-based testing might in fact fact be less trustworthy than hardware-based, it has a tremendous advantage of malleabil- ity. Modifying the behaviour (e.g. protocol update) of the software is easily realized, on the other hand it is not only complex in the case of hardware but also likely to increase the price of the product [3] and usually impossible for the consumer to tamper with as they are commonly proprietary products. There is no better solution between the two, it is a different approach to the same problem and hence testing a system from both perspectives if possible shall be recommended. However in this document we will focus solely on software testing as we did not have specialised hardware.

The Linux operating system will be used to conduct this research as it is fully -source and recent additions aiming to enable high performances have been developed. It is based on a monolithic-kernel design, meaning the OS can be seen as split into two parts: a kernel-space and user-space [4]. The kernel- space is a contiguous chunk of memory in which everything related to the hardware is handled as well as core system functions, for instance process . The user-space is where regular user programs are being executed and they have much more freedom as they ignore the underlying architecture and access it through system-calls: secure interfaces to the kernel. The issue in this model for a software-based network tool is the trade-off between the level on which the software will be located: a user-space network testing program is likely to be slowed down by the numerous system-calls it must perform, and has no control over the path the packet is going to take into the stack. A kernel-space network testing program will be faster but much more complex to design as rules within the kernel are paramount to its stability: as it is a single memory address any part of the kernel can call another function located in the kernel. This paradigm can result in disastrous effects on the system if not being cautiously manipulated.

As we require high performance to achieve line-rate we will therefore use a kernel-space network testing tool: pktgen [5]. It is a purely packet-oriented traffic generator which does not mimic actual protocol behaviour, located at the lowest level of the stack allowing minimum overhead. Its design allows to fully take advantage of the symmetric multiprocessing capabilities of a system, which is found on commodity hardware nowadays enabling parallelization of tasks, by having several queues for each CPU. Due to the overhead caused by the treatment of each packet, we will orient our research towards performance while using minimum-sized packets which is also the common practice within network testing assessment. However MTU-sized packets are a good way to benchmark the ability of a system to handle a maximum

5 amount of throughput, as smaller sized packets should never yield a higher throughput, given the same parameters. A notable advantage of pktgen is the fact that the module is found within the official kernel and therefore does not require any other installation, and can be found in all common distributions. Getting a low-level traffic generator is not enough to tell if the system is correctly optimized since it does not always reveal the bottleneck of the system. To go deeper into the performances we must get a profile: an overview of the current state of the system. In order to perform such investigations we will use perf events [6], a framework to monitor the performances of the kernel by monitoring well-known bottleneck functions or hardware events likely to reveal problems and outputting a human-readable summary. To complete the profiling done by perf, we will use the extended Berkeley Packet Filtering aka eBPF [7]. It is an in-kernel VM which is supposedly secure (e.g. does not crash, is finite) due to a strong monitoring of the code before being executed. It can be hooked onto functions and will be used to monitor certain point of interest by executing small programs from the kernel into the user-space.

1.1 Problem

While the speed of network interface cards increase, Linux’s protocol stack is also gaining more additions: for instance to implement new protocols or to enhance already-existing features. More packets to treat as well as more instructions per packet intrinsically end up in heavier CPU loads. However there are some countermeasures which have been introduced to mitigate the performance loss due to outdated designs in certain parts of the kernel. For instance NAPI [8] which reduces the flow of interrupts by switching to a polling mode when overwhelmed or the recently added xmit more API [9] allowing bulking of packets to defer usual actions to groups rather than per single packet. Considering all the recent improvements, can the vanilla kernel scale up to performances high enough to saturate 100G links?

We will assess the kernel’s performance at the lowest possible level to avoid maximum overhead therefore allowing maximum packet throughput, hence the use of pktgen. It is important to understand that pktgen’s result will not reflect any kind of realistic behaviour as its purpose is to determine the performances of a system by doing aggressive packet transmission, and the absence of overhead is the key to its functionality: it has to be seen as a tool to reveal underlying problems instead of focusing on regular protocol stack overhead. In other words, it is the first step into verifying a system’s transmission abilities and should therefore be seen as an upper-bound to real-life transmission. Implying if there are results underneath the maximum NIC speed, it would inevitably prove actual transmission scenarios above this result can not be reached. The follow-up question being: can pktgen’s performances scale up to 100G-link saturation?

Ideally the performances indicated by pktgen should be double-checked, meaning getting a second method to testify the accuracy of pktgen’s prompted performance. Hence we will use eBPF as a way to bind a program onto the NICs driver transmission function in order to measure the throughput. Can eBPF be able to correctly quantify the amount of outgoing packets knowing each call is potentially in an order of nanoseconds? If so, do the measured performances match pktgen’s results?

We hypothesize that with the current technologies added in the kernel we will be able to reach a line-rate at 100G with minimum-sized packets using the pktgen module, given proper hardware.

1.2 Methodology

We will use an empirical approach throughout this thesis. The harvesting method will consist in running pktgen while modifying certain parameters to assess the impact of the given parameter over the final result. This will be done by iterating over the parameters with a stepping big enough to finish within a reasonable amount of time but small enough to pinpoint any important change in the results. The value of the stepping will require prior tuning. Each experiment, in order to assert its validity, has to be run several times and on different machines with similar software configuration. To make the results

6 human-readable and concise they will be processed into relevant figures, comparing the parameters which were adjusted with their related pktgen experiment result. To realize the performance assessment the following experiment will be realized in order:

• Draw a simple baseline with straightforward parameters.

• Verify whether the kernel version is improving or downgrading the performances, and select the best-suited one for the rest of the experiments.

• Assess the performances of the packet bulking technique through the xmit more API option of pktgen, and verify if it improves the packet throughput.

• Tamper with the size of the NIC’s buffer as an attempt to increase the performance of packet bulking.

• Find the optimal performance parameter of pktgen.

We will also be monitoring certain metrics through profiling frameworks which are not guaranteed to be directly correlated with the experiment. To test the linear correlation of two variables (i.e. an experiment result and a monitored metric) we will use a complementary statistical analysis through the help of the Pearson product-moment correlation coefficient.

1.3 Goal

As the performances reached by the NICs are now extremely high, we need to know if the system that they are supposed to be used with are capable of handling the load without any extra products or libraries. The purpose is to understand whether or not 100Gb/s NICs are in fact any useful for vanilla Linux kernels. Therefore the goal is to provide a study of the current performances of the kernel.

1.4 Sustainability and ethics

As depicted above the goal is to assess the ability of the kernel to output a maximum amount of packets. In other words, with perfect hardware and software tuning the system should be able to reach a certain value of packets per second, sent over a wire. Whilst this does not take care of the environmental aspect directly (e.g. power-saving capabilities are disabled in favour of performance) assessing the global tuning of the system will logically help to understand if a system is using more resources than it should. Hence also indirectly assessing its power consumption. If an issue somehow reducing the global throughput is revealed, it could possibly imply machines running under the same configuration also have to put extra computing power to counteract the loss, also bringing ecological issues on a larger scale.

1.5 Delimitation

To limit the length of the thesis and impose boundaries to avoid having to go into too many endeavours we will solely focus on the transmission side. Simple examples of the use of eBPF will be provided to prevent from having to go into too much details inside the latter framework. Regarding the bulking of packets, we will exclusively look into (packet) throughput performances, while in reality such addition might in fact introduce some latency and therefore could create dysfunctions in latency-sensitive applications. The kernel should not be modified with extra libraries specialized in packet processing.

1.6 Outline

The thesis is divided as followed:

• Chapter 2 will provide a background on hardware, software, profiling and pktgen uses.

7 • Chapter 3 will explain the methodology used behind the experiments.

• Chapter 4 will summarize the experimental setup including:

– Detailed hardware description. – Research behind the performance optimization. – practical description of the realization of experiments and how they were exported into con- sequential data. – how a prototype of an interface for pktgen was realized to standardize the results.

• Chapter 5 will be a brief introduction to BCC programming, presenting the structure to create programs with the framework.

• Chapter 6 will hold the most probing results from the experiments into graphical data and their associated analysis.

• Chapter 7 will conclude and wrap-up the results.

8 Chapter 2

Background

This section will be dedicated to providing the required knowledge to the reader to fully understand the results at the end of the thesis. Going into deep details of the system was necessary to interpret the results and hence a great part of this thesis was dedicated to understanding various software/hardware techniques and technologies. To do so we will follow a path divided in several sections:

• Firstly we will introduce technical terms related to hardware to the reader as those factors will be investigated to give a deeper overview of the system. This will be done by examining different bottlenecks like the speed of the a PCIe bus or the maximum theoretical throughput on an Ethernet wire.

• Secondly we will dig into the inner working of the Linux operating system, to mainly understand the global architecture of the system but also to provide insights of how the structures and different sections interact together to transmit a packet over the wire. This will include interaction with the hardware, hence a brief study of the drivers.

• Then we will do a strong literature review of the related work accomplished on software traffic generation to compare their perks and drawbacks. Then a thorough study of the pktgen module will be realised, from its internal working to most proficient parameters influencing throughput performance.

• Last but not least there will be a brief introduction to profiling, which consists of tracing the system to assess its choke-points by analysing the amount of time spent executing functions. We will also explain how eBPF, an extended version of the Berkeley Packet Filtering originally created for simple packet analysis, which is now a fully functional in-kernel virtual machine that may now be used to investigate parts of the kernel by binding small programs to certain functions.

9 2.1 Computer hardware architecture

As we are going to introduce numerous terms that are closely acquainted with the hardware of the machine, this section will be there to clarify most of those to the reader.

2.1.1 CPU A CPU, or central processing unit, is the heart of the system as it executes all the instructions stored in the memory.

CPU Caches are a part of the CPU that store data which is supposedly going to be needed again by the CPU. An entry in the cache table in called a cache line. When the CPU needs to access data, it first checks the cache, which is directly implemented inside the CPU. If the needed data is found, it is a hit, otherwise a miss. In case of a miss, the CPU must fetch the needed data from the main memory, making the whole process slower. In principle, the size of the cache needs to be small. For two reasons, the first one being the fact it is implemented directly in the CPU, making the lack of space an issue, and secondly because the bigger the cache, the longer the lookup therefore introducing latency inside the CPU.

Multi-level caches are a way to counteract the trade-off enforced by the cache size to table lookups issue. There are different levels of caches, which are all ”on-chip” meaning on the CPU itself. • The first level cache, abbreviated L1 Cache will be small, fast, and the first one to be checked. Note that in real-life scenarios, this cache is actually divided in two caches: the one that stores instructions and one that stores data. • The second level cache, abbreviated L2 Cache will be bigger than the L1 cache, about 8 to 10 times more storage space. • The third and last level cache, abbreviated L3 Cache is much larger than the L2 cache however this characteristic vastly variates on the price of the CPU. This cache is not implemented in all brand of CPUs, however the ones that were used for this thesis did (Cf Methodology – Hardware used 4.2). Moreover L3 caches have the particularity of being shared between all the cores, which leads us to the notion of Symmetric Multiprocessing.

CORE 0 CORE 1

L1 Cache L1 Cache L1 Cache L1 Cache Instruction Data Instruction Data

L2 Cache L2 Cache

L3 Cache

Figure 2.1: Caches location in a 2-core CPU.

10 Please note that the Figure 2.1 is a simplification of the actual architecture.

2.1.2 SMP Symmetric Multiprocessing involves two or more processing units on a single system which will run the same operating system, share a common memory and I/O devices, e.g. hard drives or network interface cards. The notion of SMP applies to both completely separate CPUs and CPUs that have several cores. The obvious aim of having such an architecture is benefiting from the parallelism of programs to maximize the speed of the overall tasks to be executed by the OS.

Hyperthreading is Intel’s proprietary version of SMT (Simultaneous multi-threading) which is an- other technique to improve thread symmetric execution, and adds logical cores to the physical ones.

2.1.3 NUMA Non Uniform Memory Access is a design in SMP architecture which states that CPUs should have dedicated spaces in the memory which can be accessed much faster than the others due to its proximity. This is done by segmenting the memory and assigning a specific part of it to a CPU. CPUs are joint by a dedicated BUS (called the QPI for Quick Path Interconnect on modern systems). The memory segmented for a specific CPU is called local memory of the CPU. If it needs to access another part of the memory than its own, it is designated as remote memory, since it must go through a network of BUS connections in order to access the requested data. This technique aims to mitigate the issue of memory access on an SMP architecture, as a single BUS for all the CPUs is a latency bottleneck in modern system architecture [10]. A NUMA system is sub-divided into NUMA nodes, which represent the combination of a local memory and its dedicated CPU. With the help of the command lscpu one can view all the NUMA nodes that are present on a system. It also prompts the latency to access one remote memory node to another.

2.1.4 DMA Direct Memory Access is a technique to avoid having to make the CPU intervene between an I/O device and the memory to copy data from one another. The CPU simply instigates the transfer a with the help of a DMA controller which then takes care of the transfer between the two entities. When the transfer is done, the device involved throws an interrupt in order to notify the OS and therefore the CPU that the operation has been completed and the consequential actions should be taken, e.g. treat the packets in case of reception of packets or clean-up the memory from the buffer used in case of transmission.

2.1.5 Ethernet Ethernet is the nowadays standard used for layer-2 frame transmission that we will be using throughout this thesis. The minimum size of a frame in Ethernet was originally 64 bytes due to CSMA/CD technique being used on the link. The idea was to have a minimum time-slot ensured by this fixed size so that the time taken sending those bits on the wire would be enough for all station within a maximum cable radius to hear the transmission of the frame before it ended. Therefore if two stations started transmitting over the common medium (i.e. wire), they would be able to detect the collision. When a collision happens, a jam-sequence is sent by the station noticing it. Its aim is to make the CRC (located at the end of the frame) bogus, making the NIC discard the entire frame before computation. The minimum size packet of 64 bytes makes sense in 10Mb/100Mb Ethernet networks, as the maximum length of the cable is respectively 2500 meters and 250 meters. However, if we push the same calculation to a 1000Mb aka 1G Ethernet, the maximum length of 25m can be considered too small, not mentioning 2.5 meters on a 10G Ethernet. Whilst there are techniques in 1G Ethernet to extend the slot size and keeping the minimum frame size to 64 bytes, we will not consider them in this thesis as we will be using 10G Ethernet which is full-duplex, therefore no need for medium-sharing techniques. The 64-bytes minimum will still be used as a standard.

11 In reality when one sends a 64 bytes packet on the wire, there are in total 84 bytes that can be counted per frame.

• 64-bytes frames composed of:

– 14-bytes MAC header, destination MAC, source MAC and packet type. – 46-bytes payload, typically IP packet with TCP or UDP on top of it. – 4 Bytes CRC at the end.

• 8-byte preamble, for the sender and receiver to synchronise their clocks.

• 12 bytes of interframe-gap. There is not any actual transmission, but it is the required amount of bit-time that must be respected between each frame.

Theoretical limit As shown above, for a 60 bytes-payload (including IP and TCP—UDP headers) we in reality must count 84 bytes. This implies that for a 10-Gigabit transmission we will have a maximum of: Bandwidth 10 ∗ 109 10 ∗ 109 Max = = = ≈ 14880952 ≈ 14.88 ∗ 106 frames per second F ramesize 84 ∗ 8 672 We can conclude that the maximum number of minimum sized frames that can be sent over a 10G link is 14.88 millions per second. By applying the same calculation to a 40G and 100G link we find respectively 59.52 and 144.80 millions per seconds.

Maximum amount of packet per second to size on 10G-Link 16 Limit

14

12

10

8 Mpps

6

4

2

0 200 400 600 800 1000 1200 1400 Packet size

Figure 2.2: Theoretical limits of the link according to packet size on a 10G link.

12 Maximum amount of packet per second to size on 40G-Link 60 Limit

50

40

30 Mpps

20

10

0 200 400 600 800 1000 1200 1400 Packet size

Figure 2.3: Theoretical limits of the link according to packet size on a 40G link.

The figure 2.3 will be useful as a benchmark during our experiments, as it is the upper bound.

2.1.6 PCIe Peripheral Component Interconnect Express usually called PCIe is a type of BUS used to attach compo- nents to a motherboard. It was developed in 2004 and as of 2016, its latest release version is 3.1 but only 3.0 product are available. A new 4.0 standard is expected in 2017. PCIe-3.0 (sometimes called Revision 3) is the most common type of BUS found among high-speed NICs; because other standards are in fact too slow to provide the required BUS speed to sustain 40 or even 10 Gigabit per second if the amount of lanes is too little (see next paragraphs).

Bandwidth To actually understand the speed of PCI-e BUSes we must define the notion of ”transfer”, as the speed is actually given in ”Gigatransfers per seconds” in their specification [11]. A transfer is the action of sending a bit of data on the channel, however it does not specify the amount of bit sent because one needs the channel width to compute it, in other words without the amount of bits sent in a transaction, we can not calculate the actual bandwidth of the channel. Circumventing the complex design details, on the PCIe version 1.0 and 2.0 an 8/10b encoding is used [11, p. 192]. 8 This forces to send 10 bit for an 8-bit data transfer, implying an overhead of 1 − 10 = 0.2% for every bit transfer. 128 The 3.0 revision uses a 128b/130b encoding, limiting the overhead to 1 − 130 ≈ 0.015%. Now that we know the channel width, we can calculate the bandwidth B:

B = T ransfers ∗ (1 − overhead) ∗ 2

The table 2.1 holds the results of the bandwidth calculation. We highlighted the compatible band- width for 10G in blue and for 40G in red (10G being compatible with 40G). However using a bandwidth less large than the theoretical throughput of a NIC will function (if enough lanes for the device), but it will result in a packet throttling because of the BUS speed.

13 Version 1.1 2.0 3.0 Speed 2.5 GT/s 5 GT/s 8 GT/s 8 8 128 Encoding 10 10 130 Bandwidth 1x 2 Gb/s 4 Gb/s 7.88 Gb/s Bandwidth 4x 8 Gb/s 16 Gb/s 31.50 Gb/s Bandwidth 8x 16 Gb/s 32 Gb/s 63.01 Gb/s Bandwidth 16x 32 Gb/s 64 Gb/s 126.03 Gb/s

Table 2.1: PCIe speeds

2.1.7 Networking terminology DUT Device Under Test, the targeted device that we aim to assess its performances.

Throughput The throughput is the fastest rate at which the count of test frames transmitted by the DUT is equal to the number of test frames sent to it by the test equipment. [12]

14 2.2 Linux

The Linux operating system started in 1991 as a common effort to provide a fully open source operating system by . It is UNIX-based and the usage of Linux is between 1 to 5 % of the global market, implying that it is scarcely used by users. However this data is quite unreliable as most companies or researcher rely on publicly available data, for instance the User-Agent header passed in a HTTP request that however can be forged, or worldwide device shipments that tend to be unreliable as well since most laptops will at least allow dual-booting with a second OS. While not being frequently used within the major part of the population, it is extremely popular among the server market share, its stability, open-source code and constant update making it a weapon of choice for most system-administrators. Whilst it will be referred as ”Linux” in this document, the correct term would be GNU/Linux as the operating system is a collection of programs on top of the Linux kernel and Linux is depended from GNU softwares.

Figure 2.4: Tux, the mascot of Linux

2.2.1 OS Architecture design Linux is a monolithic kernel design [4, p. 7], meaning that it is loaded as a single binary image at boot, stored and ran in a single address space. In other words: the base kernel is always loaded into one big contiguous area of real memory, whose real addresses are equal to its virtual addresses [13]. The main perks of such an architecture being the ability to run all needed functions and drivers directly from the kernel space, making it fast. However it comes with a price of stability issue, as the whole kernel runs along as a single entity, if there is an error on any subset of it, the system’s stability as a whole can not be guaranteed.

Whilst such drawbacks could seem as an impediment for the OS, monolithic kernels are not only mature nowadays, but the almost-exclusive design used in industry. It opposes to micro-kernels, which we will not detail as it is outside the scope of this study. But it is not realistic to talk about ”pure” monolithic kernel, as Linux actually has ways to dynamically load code inside the kernel space, more precisely pre-compiled portions of code which are called loadable kernel module or LKM. As the code can not be loaded inside the same address space that the kernel uses, the memory will be allocated dynamically. [13] The flexibility offered by LKMs are absolutely crucial to Linux’s malleability: if every component had to be loaded at boot the size of the boot image would be colossal.

15 The operating system can be seen as being split into three parts: the hardware which is obviously immobile, the kernel space and user-space. This segmentation makes sense when it comes to memory allocation as explained above. The kernel-space is static and continuous, it runs all the functions that interact directly with the hardware (drivers) and its code can not change (unless the code being exe- cuted is a LKM). The user-space is much more free of action, as the memory it uses can dynamically be allocated therefore making the loading and evolutions of programs quite flawless. However to interact with hardware, e.g. memory or I/O devices, it must go through system-calls. System-calls are functions that aim to make usage of a service from the kernel, while abstracting the underlying complexity through simple functions.

Figure 2.5: Overview of the kernel [4]

2.2.2 /proc pseudo-filesystem The /proc folder is actually a whole other file-system on its own called [4, p. 126]. Loaded at boot, its purpose is to be a way to harvest information from the kernel. In reality it does not have any physical files (i.e. written over hard-disks), all of the ones represented inside of it are actually stored in the memory of the computer (called ram-based file-system) rather than on a hard-drive, also implying they will disappear at shut-down. It was designed to gather any kind of information the user could need to inspect about the kernel, often related to performances. A lot of programs interact directly with this information to gain knowledge of the system, for instance the well-known command ps makes usage of different statistics included in /proc. However it is even more powerful, as we can directly ”hot-plug” functionalities from inside the kernel by interacting with /proc, for instance which CPUs are pinned to a particular interrupt can be

16 changed with the help of it. Needless to say, not all functionalities inside the kernel can be changed by simply writing a number or a string inside the /proc. This becomes a key-element when linked not only to the vanilla kernel but also its modules. As explained previously, we can load or unload LKMs, and as they are technically part of the kernel we therefore will find their status and configuration interfaces in the /proc.

Other information systems

: another ram-based files-system this time whose goal is to export kernel data structures, their attributes, and the linkages between them to userspace [14]. Usually mounted on /sys.

• configfs: complimentary to sysfs, it allows the user to create, configure and delete kernel objects [15].

2.2.3 Socket Buffers Socket buffers or SKBs are single-handedly the most important structure in the Linux networking stack. For every packet being present in the operating system, a SKB must be affiliated to it in order to store its data in memory. It has to be done in kernel space, as the interaction with the driver happens inside the kernel [16]. The structure sk buff is implemented as a double linked list in order to loop through the different SKBs easily. Since the content of the sk buff structure content is gigantic we will not go into too much detail here, but here are the basics [17]:

Figure 2.6: How pointers are mapped to retrieve data within the socket buffer [18].

The socket buffers were designed to encapsulate easily any kind of protocol, hence there are ways to access the different parts of the payload by moving a pointer around and mapping its content into a structure.

17 As shown in figure 2.6 the data is located in a contiguous chuck of memory and pointers indicate the location of the structure. When going up the stack, extra pointers are mapped to easily recognize and access the desired part of the packet, e.g. IP header or TCP header. Important note, the data pointer DOES NOT refer to the payload of the packet, and reading from this will most likely end up in gibberish values for the user. With the help of system calls, SKBs are abstracted for user-space programs who most likely will not make use of underlying stack architecture. However those system calls are not accessible from inside kernel-space. To decode easily data from within the kernel, pre-existing structures with the usual fields of protocols are found and by mapping a pointer to a structure one can make packet content understandability trivial.

Reference counter Another very important variable the structure holds is called atomic t users. It is a reference counter, a simple integer that accounts the amount of programs that are using the SKB. Is it implemented as an atomic integer, meaning that it must be tampered only through the help of specific functions that will ensure the integrity of the data among all cores. It is originally initialized at the value 1 and if it reaches 0 the SKB ought to be deleted. Users should not interact with such counters directly however as we will see with pktgen the latter is not always respected.

2.2.4 xmit more API Since kernel 3.18 some efforts were made to optimize the global throughput through batching, i.e. bulking packets as a block to be sent instead of being treated one by one. Normally when a packet is given to the hardware through the driver, several actions are made like locking the queue, copying the packet to the hardware buffer, tell the hardware to start the transmission, etc [9]. The idea was to simply communicate to the driver that several more packets are coming and can therefore delay several actions knowing it will be a better fit to postpone it until there are no more packet to be sent. It is important to note the driver is not forced in any way to delay its usual procedures, and is the one taking the decision. To make this functionality available to drivers while not breaking the compatibility with the old ones, a new boolean in the SKB structure xmit more has been added. If set to true, the driver knowns there are more packets to come.

2.2.5 NIC drivers NIC drivers handle the communication between the NIC and the OS, primarily to handle the packet sending and reception. There are two solutions to receiving packets:

• Interrupt: in case of reception of a packet, the NIC sends an interrupt to the OS in order for it to retrieve the packet. But in case of a high-speed reception, the CPU will most likely be overwhelmed by the interrupts, as they are all executed with a higher priority than other tasks.

• NAPI: To mitigate the latter issue, we disable temporally the interrupts and switch to a polling mode. It is done through the help of the New API which is an extension to the packet processing framework [8]. It is done by switching off the interrupt of a NIC when it reaches a certains threshold fixed at driver initialization [16].

Here are the common pitfalls that can influence the NIC driver performance [19]:

• DMA should have better performance than programmed I/O, however due to the high overhead caused by it, one should not allow DMA under a certain threshold.

• For PCI network cards (which is the only relevant type for high-speed networks nowadays) the size of the burst size for DMA is not always fixed and must be determined. This should coincide with the cache size of the CPU, making the process faster.

• Some drivers have the ability to compute the TCP checksums, offloading from the the CPU and gaining efficiency due to optimized hardware.

18 2.2.6 Queuing in the networking stack The queuing system in Linux is implemented through an abstraction layer called Qdisc[20]. Its uses ranges from a classical fifo algorithm to more advanced QoS-aimed queuing (e.g. HTB or SFQ). Though those methods can be circumvented if one user-level application fires multiple flows at once [21].

Driver queues are the lowest-level networking queue that can be found in the OS. It directly in- teracts with the NIC through DMA. However this interaction is done asynchronously between the two entities (the opposite would make the communication extremely inefficient) hence the need of locks to ensure data integrity. The lowest function to directly interact with the driver queue that one can use is dev hard start xmit(). Nowadays most NIC have multiple queues, to benefit best from the SMP capabilities of the system [22]. For instance, the 82580 NIC from Intel and their variants support multi-queues. Some frameworks (e.g. confer 2.3.5) allow direct access to the NIC registers for better analysis and tuning of the hardware.

19 2.3 Related work – Traffic generators

2.3.1 iPerf iPerf [23] is a user-space tool made to measure the bandwidth of a network. Due to its userspace design, it can not achieve high packet speed because of the need of using system calls to interact with the lower interfaces, e.g. NIC drivers or even qdiscs. To mitigate this overhead issue the user might use a zero-copy option to make the access to the packet content faster. It is able to saturate links through the use of largely sized packets, and can even report MTU if unknown to the user. It may measure jitter/latency through UDP packets. You must be running both instances of server and client of iPerf to make the program run. An interesting new option is the handling of the SCTP protocol in the version 3. The simplicity of installation and use make it a weapon of choice for network administrators that wish to check their configurations. It is important to note that this project is still maintained and being updated frequently by the time of this thesis. Note that this is the only pure user-space oriented traffic generation tool that we will describe here, as their performance can not match other optimized framework. Other userpace examples include (C)RUDE [24], NetPerf [25], Ostinato[26], lmbench [27].

2.3.2 KUTE KUTE [28] is a UDP in-kernel traffic generator. They divided their program into two LKMs, a sender and a receiver. Once loaded, the sender will compute a static inter-frame gap based on the speed specified by the user during setup. One improvement they advertise is to directly use the cycle counter located in CPU registers instead of the usual kernel function to check the time, as it was considered not precise enough. Note that as this technology is from 2005, this information might be outdated. An interesting function is that they do not handle the Layer 2 header, making it theoretically possible to use over any L2 network. The receiver module will provide statistics to the user after the end, when it is been unloaded.

2.3.3 PF RING PF RING [29] is a packet processing framework developed by the ntop company. The idea was, as for pktgen and KUTE, to put the entire program inside the kernel. However it goes a step further by proposing actual kernel to user-space communication. The architecture, as the name suggests, is based on a ring buffer. It polls packets from the driver buffers to the ring [30] through the use of the NAPI. While it does not require particular drivers, the addition of PF RING aware drivers are possible and should provide extra efficency. Entirely implemented as a LKM, they advertise a speed of 14.8 Mpps ”and above” on a ”low-end 2,5GHz Xeon”. However they do not state clearly whether that concerns packet transmission or capture, leaving the latter statement ambiguous.

PF RING ZC is a proprietary variant of PF RING, and is not open-source. On top of the previous features they offer an API which is able to handle NUMA nodes, as well as zero copy packet operation, supposedly enhancing the global speed of the framework. On this version traffic generation is explicitly possible. It can also share data among several threads easily, thanks to their ring buffer architecture coped with zero copying.

2.3.4 Netmap Netmap[31] aims to reduce kernel-related overhead issues by bypassing the kernel with its own home- brewed network stack. They advertise a 10G wirespeed (i.e 14.88 Mpps) transfer with a single core at 1.2 Ghz. As number of improvements they:

• Do a shadow copy (snapshot) of the NIC’s buffer into their own ring buffer to support batching, bypassing the need of skbuffers, hence gaining speed on (de)allocations.

20 • Efficient synchronization to make the best use of the ring buffer.

• Natively supports multi-queues for SMP architectures through the settings of interrupt affinities.

• The API is still completely independent from the hardware used. The devices driver ought to be modified to interact correctly with the netmap ring buffer, but those changes should always be minimal.

• Netmap does not block any kind of “regular” transmission from or to the host even with a NIC being used by their drivers.

• They also handle the widely used libpcap library by implementing their own version on to of the native API.

• The interaction with the API is done through /dev/netmap, and the content is updated by polling. The packets are checked by the kernel for consistency.

It is also implemented as a LKM making it easy to install however drivers might need to be changed for full usability of the framework.

2.3.5 DPDK The Data Plane Development Kit [32] is a ”set of libraries and drivers for fast packet processing”. It was developed by Intel and is only compatible with Intel’s x86 processors architecture. They advertise a speed of 80 Mpps on a single Xeon CPU (8 cores), which is enough to saturate a 40G link. DPDK moves its entire process in the user-space, including ring buffers, NIC polling and other features usually located inside the kernel. It does not go through the kernel to push those changes or actions, as it features an Environment Abstraction Layer, an interface that hides the underlying components and bypasses the kernel by loading its own drivers. The offer numerous enhancements regarding software and hardware, e.g. prefetching or setting up core affinity among many other concepts.

2.3.6 Moongen Moongen [33] is based on the DPDK framework, therefore inheriting its perks and drawbacks. Moongen brings new capabilities to the latter framework by adding several paradigms as “rules” for the software: It must be fully software implemented, and therefore run on off-the-shelf hardware, it must be able to saturate links at 10G wirespeed (i.e. 14.88 Mpps), be as flexible as possible, and least but not least support precise time-stamping/rate control (i.e. inter-packet latency). They found that the requirements were best fulfilled by implementing malleability through Lua scripting, as the language also has fast performances due to JIT support (Cf 2.5.2). The architecture behind Moongen lies on a master/slave interaction, set up within the script the user must provide. The master process will set-up the counters, including the ones located on NICs, and the slave will effectuate the traffic generation. An interesting feature that was introduced in this traffic generator was a new approach to rate control. As explained previously, NICs have an asynchronous buffer to take care of packet transmission, and a usual approach to control the rate is to wait time between packets. However the NIC might not decide to send to packets as they arrive. Instead of waiting, Moongen fills the inter-packet gap with a faulty packet: they forge a voluntarily incorrect packet checksum so that the receiving NIC will discard it upon arrival. However this method is limited due to the NIC having a minimum-size packet acceptance, making the faulty packets having to be a certain size which can be impractical in some situations. They advertise a speed of 178.5 Mpps at 120 Gbit/s, with a CPU Clock at 2 GHz.

2.3.7 Hardware solutions There are a numerous amount of examples that we could provide with hardware technologies oriented for network testing, but as this experiment will mostly focus on the software spectrum of traffic generation, we will not expend too much on this topic. Companies like Spirent [34] or IXIA [35] provide solutions.

21 2.4 Pktgen

Introduction pktgen is a module of the Linux kernel that aims to analyse the networking performances of a system by sending with as many packets as possible [5]. It was developed by Robert Olsson and has been integrated to the Linux main tree in 2002 [36]. The usage of the tool is made through the procfs. All the related files mentioned in the following paragraphs related to pktgen are located in /proc/net/pktgen. To interact with the module, one must write into a pre-existant file representing kernel threads dedicated to pktgen. There are as many threads as there are cores, for instance the file ”kpktgend 0” is the file bound to the thread for the core number 0. This information is important as nowadays CPUs all have SMP, hence the need of support for such architecture. The user then passes commands by directly writing inside those files.

# echo "add device eth0" > /proc/net/pktgen/kpktgend 0

Figure 2.7: Example of a shell command to interact with pktgen.

Figure 2.7 shows a typical example of interaction between the pktgen module and the user. By redirecting the output of the echo command, we pass the command “add device” with the argument “eth0” to the thread 0. Please note that all writing operations in the proc filesystem must be done as superuser (aka root). If the operation is unsuccessful, there will be an I/O error on the command. While this might seem slightly disrupting a first, the design choice behind this interface is due to the module being in-kernel making a proc directory the simplest design to allow interaction with the user.

Example Now that the interaction between the user and the module has been clarified, here is a representative description of how to typically use pktgen, that can be logically split in 3 steps.

1. Binding: The user must bind one or more NICs to a kernel thread. Fig 2.7 is an example of such action.

2. Setting: If the operation is successful, a new file will be created, matching the name of the NIC (or associated queue). For instance by executing the command in Fig 2.7, a new file eth0 will be created in the folder. The user must then pass the parameters that he or she wishes by writing in the latter file. A non exhaustive list of parameters would be:

• count 100000 – Send an amount of 100000 packets. • pkt size 60 – Set the packet payload to 60 bytes. This does include IP/UDP headers. Note that 4 extra bytes are added by the CRC on the frame. • dst 192.168.1.2 – Set the destination IP.

3. Transmitting: When all the parameters are set, the transmission may start by passing the parameter start to the pktgen control file pgctrl. The transmission will either stop by interrupting the writing operation (typically CTRL+C in the terminal) or when the total amount of packets to be sent will be matched by the pktgen counter. The transmission statistics, as in time spent transmitting or number of packets per seconds will be found in the file(s) matching the name of the interfaces used in the second step, e.g. eth0.

While it is possible to associate one thread with several NICs, the opposite is not. However pktgen has a workaround to be able to profit from the multi-core capacities, by adding the number of the core after the name of the NIC: eth0@0 will result in interacting with the NIC eth0 through the core 0.

2.4.1 pktgen flags pktgen has a several flags that can be set upon configuration of the software. The following list is inclusive

22 and up to date, as it was directly fetched and interpreted from the latest version of the code (v2.75 – Kernel 4.4.8).

Flag Purpose IPSRC RND Randomize the IP source. IPDST RND Randomize the destination IP. UDPSRC RND Randomize the UDP source port of the packet. UDPDST RND Randomize the UDP destination port of the packet. MACSRC RND Randomize the source MAC address. MACDST RND Randomize the destination MAC address. TXSIZE RND Randomize the size of the packet to send. IPV6 Enable IPv6. MPLS RND Get random MPLS labels. VID RND Randomize VLAN ID label. SVID RND Randomize SVLAN ID label. FLOW SEQ Make the flows sequential. IPSEC ON Turn IPsec on for flows. QUEUE MAP RND Match packet queue randomly QUEUE MAP CPU Match packet queue to bound CPU NODE ALLOC Bind memory allocation to specific NUMA node. UDPCSUM Include UDP checksums. NO TIMESTAMP Do not include timestamp in packets.

Table 2.2: Flags available in pktgen.

The highlighted flags in the table 2.2 coloured in grey represent the most important ones to enforce the performance of the system. QUEUE MAP CPU is a huge performance boost because of the thread behaviour of pktgen. In short, when the pktgen module is loaded it creates a thread for each CPU core detected on the system, this includes logical cores, then a queue is created to handle the packets to be sent (or received) for each thread, that way they can all be independently used instead of a single queue that would require great concurrency to function. It also takes benefit from the ability of recent NICs to do multi-queuing. Setting this flag ensures the queue the packet will be sent to is located on the same as the current core treating the packet. NODE ALLOC is obviously only needed in a NUMA-based system, and allows to bind an interface (or queue, as explained) to a particular NUMA memory bank, avoiding latency caused by having to fetch into remote memory. Note that during the scope of this thesis we will not be treating pktgen options that change or modify the protocol used during the transmission, e.g. VLAN tagging, IPsec, or MPLS. This is outside the scope as we only care about maximum throughput and therefore will not have any use for such technologies.

2.4.2 Commands there are quite a few commands that can be passed to the module.

1. The commands used on the threads ”kpktgend X” are straightforward: add device to add a device, append ’@core’ to the device name create new queue associated, and rem device all removes all associated devices (and their configuration) of a thread.

23 2. The commands on used on the ”pgctrl” file are also obvious: start begins the transmission (or reception)and stop ends it.

3. Most of the commands passed to the device are easily understandable and well documented in [37]. We will only list commands that need to be explained:

• node < integer >: when the NODE ALLOC flag is on, this binds the selected device to the wanted memory node. • xmit mode < mode >: set the type of mode pktgen should be running. By default the value is start xmit, which is the normal transmission mode which we will detail further in the next paragraph. The other mode is netif receive which turns pktgen into a receiver instead of a trans- mitter. We will not go into the details on the algorithm here as it will not be charted here; however the algorithm is summarized through a diagram in the appendix. • count < integer >: select the amount of packets to be sent. A zero will result in a infinite loop until stopped. It is important to note that because of the granularity of the timestamping inside pktgen, an amount of packets considered too small will result in a biased speed advertised. As a recommendation the program must run for at least a few millisecond, therefore the count number must match the speed of the medium. • clone skb < integer >: This option aims to mitigate the overhead caused by having to do a memory allocation for each packet sent. This is done by ”recycling” the SKB structure used, hence sending a carbon-copy of the packet over the wire. This is done through a simple incrementation of the reference counter, to avoid its destruction by the system. The integer passed as an argument will be the amount of copies sent over the network for 1 SKB allocation. Example, by using skb clone 1000, packets number 1 to 1000 will be the same, then packets from 1001 to 2000 will be the same, etc. • burst < integer >: This option is the most important one for maximum throughput, as testified by the experiments further. It makes use of the xmit more API hence allowing bulk as explained previously.

2.4.3 Transmission algorithm

Through a code review, we will now explain the internal workings of pktgen when it comes to packet trans- mission. The following explanation concerns the pktgen xmit function located in net/core/pktgen.c. Everything commented in this section is condensed in the figure 2.8

1. At start options are retrieved like the burst (which is equal 1 by default), through atomic access if necessary. The device is checked to be up, with a carrier, if not the function will return. This implies that in case of the device not being up, no error will be returned to the user.

2. If there are not any valid SKBs to be sent or it is time for a new allocation, pktgen frees the current SKB pointer with kfree skb (if it is null the function will simply return). A new packet will be allocated and filled with the correct headers through fill packet() function. If the latter did not work, the function will return.

3. If inter-packet delay is required, the spin() function is fired.

4. The final steps to sending out packets are to retrieve the correct transmission queue, disable software-irq as bottom halves could delay the traffic, lock the queue for this CPU.

5. Increment the reference counter with the amount of bursting data about to be sent. This should not happen here, and will be discussed in section 4.8.

24 6. Start sending loop: send a packet with the xmit more API compliant function netdet start xmit(). The latter takes as an argument, among others, a boolean to indicate if there is more data to come, so in case the SKB is unique it will be set to false. Otherwise set to true until we run out of bursting data to send.

7. In case of error on transmission returned by netdet start xmit(), except if the device was busy in which case we will try once more, the loop will exit.

8. In case of success update the counters: number of packets sent, amount of bytes sent and sequence number.

9. If there is still data to be sent (i.e. burst > 0), go back to start of sending loop, also check the queue is not frozen. Otherwise exit loop.

10. Exit loop: unlock the queue bound to CPU and enable software-irq.

11. If this is the end of all transmissions programmed, pktgen will check that the reference counter of the last skb is 1, then will stop the program.

12. Otherwise the function ends here.

25 Figure 2.8: pktgen transmission algorithm

26 2.4.4 Performance checklist Turull et al. [36] issued a series of recommendations to be sure the system is properly configured to yield the best performances of a pktgen traffic generation.

• Disable frequency scaling, as we will not focus on energy matters.

• Same goes with CPU C-States, their purpose being power saving we should limit its use in order for the CPU to avoid creating latency by falling into a ”sleep” state.

• Pinning the NIC queue interrupts to the matching CPU (or core), aka ”CPU affinity”. This recommendation was already issue by Olsson [5].

• Because of the latter statement, one should also deactivate interrupt load balancing as is spreads the interrupts among all the cores.

• NUMA affinity which maps a packet to a NUMA node can be a problem if the node is far from the CPU used for instance. As explained previously pktgen supports assigning a pktgen to a specific node.

• Ethernet flow control has the possibility of sending a ”Pause frame” to temporally stop the trans- mission of packet. We will disable this as we will not focus on the receiver side.

• Adaptive Interrupt Moderation (IM) must be kept on for maximum throughput and minimizing overhead from the CPU.

• Placing the sender and receiver on different machines to avoid having the bottleneck located on the BUS of a same machine.

We will later be carefully adjusting the parameters of the machines used through the help of scripting and/or Kernel/BIOS settings if possible.

27 2.5 Related work – Profiling

Profiling is getting records of a system (or several systems) called the profile. It is commonly used to evaluate the performances of a system by estimating if certain parts of the system are being too greedy/slow, e.g. taking too much CPU cycles for its operations compared to the rest of the other actions to be executed. We will only pay attention to Linux profiling, as the entire subject was based on this specific OS, and therefore will talk about techniques that might not be shared among other commonly used operating systems (e.g. Windows or BSD based). There were two profiling systems that were investigated during this thesis, the first one being perf [38] and the second one is in fact more than a profiling tool, as it has several other purposes and was recently turned into a profiling tool in the latests kernel versions: eBPF [7].

2.5.1 perf Perf also called perf events is fairly broad spectrum into the profiling capabilities. It is based on the notion of events, which are tracepoints that perf pre-programmed inside the kernel. The tool has, by default, several default events from different sources [39]: • Hardware Events: Use CPU performance counters to gain knowledge of cpu cycles used, caches misses and so on. • Software Events: low level events based on kernel counters. For example, minor faults, major faults, etc. • Tracepoint Events: based on ptrace, which is the same lib used by gdb to debug user-space programs, perf has several pre-programmed tracepoints inside the kernel. They are located on ”important” function, meaning function that are almost-mandatory to be executed for a system- call to function correctly. For example, the tracepoint to deallocate a SKB structure is called sock:skb free. The list of tracepoints used by perf can be found by running sudo perf list tracepoints . • Dynamic tracing: this is NOT exclusive to perf, it is a kernel function that perf uses for monitoring. The principle is to create a ”probe”, either called kprobe if located in kernel-code or uprobe if in user-code. This is an interesting functionality as it brings us the ability to monitor any precise function we wish to investigate, instead of relying on general-purpose functions (tracepoints). • Snapshots frequency: perf is able to take snapshots at a given frequency to check the CPU usage. The more a function is called, the more samples are aggregated and the function is considered taking more CPU (total percentage of samples). One of the perks is perf’s ability to also record the PID, and call-stack to provide a full knowledge of what and whom caused the system to use all of the CPU, as the name of a single function might not only be complex to pinpoint, but might also be called from several spots.

Kernel symbols The kernel keeps a mapping of addresses to name to be able to translate to a human-readable output the results of executions. The names can be matched to several things, but we will only pay attention to function and variable names. overhead calculation With the -g option which check the entire stack for the calculation of the total percentage of utilization, perf shows two percentage per function: self and children. This is because a function can obviously call other functions recursively, making the ”actual” total amount of time spent in the caller function biased. Therefore the split makes perfect sense: the “self” number represent the percentage of samples from the call of the function and the ”children” number corresponds to the total percentage induced by the function, including the function calls it performs and therefore their percentage being included in that number too.

28 Figure 2.9: Example of call-graph generated by perf record -g foo [38]

2.5.2 eBPF BPF Historically, there was a need for efficient packet capturing. There were some other programs, but usually costly. Along came BPF, for , with the idea of making a user-level packet capture efficient. eBPF is the extended version of BPF, as in recent versions of the kernel it has been greatly enhanced. We will discuss those differences soon. The idea is to run user-space programs, i.e. filters, inside the kernel-space. While this may sound, dangerous, the code produced by the user-space MUST be secure. Meaning, there are only a few instructions that can be actually put inside this filter. To restrict the available possibilities of coding instructions BPF has a created its own interpreted language, a sort-of x86 assembly instruction set.

There is a structure (linux/filter.h) that can be used by the user-space defined program to explicitly pass the bpf code to the kernel: struct sock_filter{ /* Filter block */ __u16 code; __u8 jt; __u8 jf; __u32k; }; Listing 2.1: Structure of a BPF program

The variables within this structure being:

• code: unsigned integer which contains the opcode to be executed.

• jt: unsigned char containing the address to jump to in case of test being true

• jf: unsigned char containing the address to jump to in case of test being true

• k: unsigned long usually containing test variable with constant or addresses to load/store to.

To attach a filter to a socket (as it was originally designed for) one must pass through another structure:

29 struct sock_fprog{ /* Required for SO_ATTACH_FILTER. */ unsigned short len; /* Number of filter blocks */ struct sock_filter __user *filter; }; [...] struct sock_fprog val; [...] setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER,&val, sizeof(val)); Listing 2.2: Binding to a socket

The first parameter being simply the number of instructions and the second one a simple pointer to the previous structure. The user macro adds an attribute for the kernel to understand that the code it is about to run shall not be trusted. This is for the need of security. Last but not least, to actually make the connection between the structure sock fprog and the socket itself, assuming we correctly opened a socket with a file descriptor sock by running setsockopt().

The complexity of BPF programming resides in the pseudo-assembly forced programming. This is done automatically through libpcap or tcpdump. However for programs in C, this quickly becomes too inconvenient and should not be done.

Figure 2.10: Assembly code required to filter packets on eth0 with tcp ports 22.

The figure 2.10 illustrates how complex creating a simple BPF program is. As you might recognize from the listing 1.1, each row is indeed divided in four field as explained. extended BPF Over the last few years, the BPF program has been remodelled. It is not limited to the only usage of packet filtering any-more, and can now be seen as a virtual machine inside the kernel thanks to its specific instruction set [7]. Breaking the shackles of packet filtering came with wholesome features which we will explain.

• The size of the register and arithmetic operations switched from 32 bits to 64 bits, unlocking the power of nowadays CPU which are 64-bits oriented; at least for performance oriented systems.

30 • While eBPF programs are not retro-compatible with classical BPF, the translation between the two is done before execution making it seamless for the user.

• Instead of being bound to a socket, there is now a dedicated system call, bpf(), to be able to insert eBPF programs easily from the user-space.

– The system-call is unique and takes as parameters the different actions that can be executed. There are wrapper functions that abstract the use of the system-call, making it more human- readable. – On the execution of the system call, a verifier is executed to check the instruction(s) are con- sidered ”secure”. The program is in fact simulated to check if any access might be problematic to the security of the system. – Since the kernel 4.4 (released at the beginning of the thesis) the bpf() sys-call does not require to be root to be launched; however this is of course only relevant for that is accessible to regular user, therefore limited to socket filters. [40].

• The framework now has integrated maps:

– The maps are a simple key/value storage format. – They can be accessed either by user-space or kernel-space. – The key and value format can be a custom structure.

• eBPF programs can be used as a kprobe; mainly because of their ”secure property” making sure it will not leave the system hanging. However certain functions are not allowed to be used a kprobes.

• eBPF programs can be used as a tc classifier.

Tool chain eBPF uses the Just-In-Time (JIT) compilation method, which makes the compilation happen at run- time into machine code [41]. We will not get into the details of how this is actually making the process faster, but it is said to increase the performance 3.5x to 9x for interpreted languages [42]. Note that it has to be turned on through the procfs to function, and your kernel must have the CONFIG BPF JIT option turned on. To generate optimized code through JIT, the tool-chain behind it is complex but it is black-boxed through the use of the Low Level Virtual Machine (LLVM): a project who provides a modern source and target- independent optimizer, along with code generation support for many popular CPUs. Their libraries are built around the IR language that they use to represent data [43]. To compile from C language, the clang program is used. Developed alongside LLVM, it is a fast compiler that provide code for the LLVM.

BCC The above paragraph shows how complex getting an eBPF program from C code to an execution truly is. And because of the lack of example to actually compile eBPF, even producing a typical ”hello world” is not straightforward. The silver bullet to this problem is brought by the Iovisor project, by providing a ”compiler collection” to automate and simplify the creation of eBPF programs: the BPF Compiler Collection [44] Understanding and building programs with BCC was a great part of the work of this thesis, and will be detailed in the BCC programming section.

31 32 Chapter 3

Methodology In this chapter we will be describing how the experiments were carried out.

3.1 Data yielding

During the experiments the ability to create a lot of data without having to constantly monitor the execution of it became quickly a need. The principle is to create convenient methods that could automatically variate the parameters, more or less brutally according to the needs and store the results, along with the experiments settings, into a file to be post-processed into human-understandable data; e.g. plots. The solution was to create a program that would take as parameters the setting(s) to variate, the step- ping and the limit to be reached.

Figure 3.1: Representation of the methodology algorithm used

33 3.2 Data evaluation

The data acquired was created through empirical testing, adjustment and tuning to the fit the situation. The data acquired must follow several rules in order to be kept as a final result of this work:

• Reproducibility: the experiment must yield the same results by being ran over the same settings. While it may sound obvious, a lot of data has been discarded after several hundreds tests due to the behaviour not being exactly reproducible. It also does not necessarily mean that the results are bogus but is either due to bad measuring or that the anomaly is spurious and would take too much time pinpointing.

• Granularity: As this thesis focused mostly on high performance, a single byte might or might not change the outcome of an experiment. Therefore the experiments were first run with an average stepping; meaning with settings variation large enough to end a set of experiments in a reasonable amount of time (e.g. few hours) but small enough to reduce an anomaly to a particular range of settings. Of course finding this trade-off has also been part of the work and required experimenting.

• Interpretation: to ensure that the results are correctly interpreted, extra profiling tests were always ran to be certain that they are not being compromised by another program that would conflict in any way.

3.3 Linear statistical correlation

Throughout the thesis we used profiling whose goal was to find a correlation between a problem and its origin. The idea was to create a batch to tests and measure a particular event along with the test, and see if there would be a possible match between the two sets. For instance, we would run a throughput test with pktgen and increasingly growing the size of the packet. For every experiment, we would also take the amount of cache-misses. To find out whether or not those factors are linearly correlated, we will use Pearson product-moment correlation coefficient. P P P n xiyi − xi yi r = q q P 2 P 2 P 2 P 2 n xi − ( xi) n yi − ( yi)

Figure 3.2: Pearson product-moment correlation coefficient formula.

Without getting into the details of the formula, the r value yielded by the above formula will locate between a range of -1 to 1: −1 ≤ r ≤ 1. The interpretation can go as such: a value of 1 will indicate a positive correlation between the two sets of data, -1 will indicate a negative correlation. A 0 implies no correlation. Now in the case of realistic data, none of the above will ever happen but rather a real value between 1 and -1 hence the results will be interpreted as such [45]:

• 0.00 ≤ |r|≤ 0.19 ”Very weak”

• 0.20 ≤ |r|≤ 0.39 ”Weak”

• 0.40 ≤ |r|≤ 0.59 ”Moderate”

• 0.60 ≤ |r|≤ 0.79 ”Strong”

• 0.80 ≤ |r|≤ 1 ”Very Strong”

34 Chapter 4

Experimental setup

While most of the documented effort of this thesis is aimed towards unveiling the underlying framework and its associated performance bottlenecks, a lot of the concrete work resided in installing and setting up environments. Needless to say, it is irrelevant for the reader to know every detail however if interesting issues arose they will be mentioned.

4.1 Speed advertisement

During this research, a lot of traffic generators and libraries have been examined. No matter what their perks were, they have a common problem: the ”speed advertised” by almost all of them is unreliable. This is a direct consequence of the lack of common practices within the traffic generation community, as a result there are no academical benchmarks that have been set for performance reviews, making the advertised speed more of a marketing aspect than an factual processing indicator. There are three major factors that should be advertised to be able to assess accurately the throughput of a traffic-generator.

Hardware The first and foremost important factor should always be an accurate description of the machine’s architecture. Commenting as ”commodity hardware” is too far fetched, even though heavily implies off-the-shelf was used, and such wide ranges should not be tolerated in a careful investigation. The most important criteria are (but not exclusively):

• CPU(s): model, clock speed, number of cache level including their respective sizes, number of cores, maximum PCIe version capability.

• NIC(s): model, maximum theoretical speed, PCIe version and number of lanes, multi-queue capa- bility.

• Motherboard: model, block diagram, QPI if needed.

Those key performance criteria are obviously subject to change especially with new features being added.

Underlying software While this factor is less relevant in some cases, e.g. DPDK that bypasses the Linux Kernel, the configuration may still be relevant as it is very often subject to change and might still affect the overall performances of the system. The criteria that should be reported are straightforward: the version of the kernel used and any kind of performance-affecting options. Also the drivers and their versions should be mentioned, and optimizations that could affect performance.

Scalability SMP architecture being the exclusive architecture that one can buy nowadays, giving a single example of your software’s performance is not good enough. The scalability of the process must be documented. And this is not exclusive to a single NIC, as Link Aggregation is a very common technique showing the results of the software having several processes over different NICs is an excellent way to testify of its scalability performances.

35 4.2 Hardware used

This section will be about hardware specific informations. In total four machines were provided as support for this thesis, two from the KTH CoS laboratory and two from the Ericsson performance lab. They will include a thorough description of their components and a block-diagram as a summary.

4.2.1 Machine A – KTH First and foremost this is the machine that helped out calibrating and carrying most of the experiments, as a benchmark. It did not possess the most recent hardware but rather because of its convenient accessibility in the laboratory at Electrum, Kista.

Model Xeon E5520@2,27GHz L1i Cache 32K L1d Cache 32K CPU L2 Cache 256K L3 Cache 8192K QPI 5.86 GT/s

Motherboard The motherboard used was the Tyan S7002. While it supports up to 2 CPUs, only one was present while carrying out the tests. This implies there are NUMA nodes however the machine was set up in a way that the memory bank will always be local to the unique CPU, hence making NUMA nodes almost irrelevant in our case. The CPU and the NIC are linked through a northbridge.

Memory Total available memory: 31GB.

NIC The network interface card assessed was the 82599ES 10-Gigabit controller, using a SFP+ transceiver. The driver used was ixgbe version 4.3.15. MACHINE A Memory channel QPI 5.86 GT/s DDR3 Xeon DDR3 E5520 Empty CPU slot DDR3 2.27GHz

82599ES QPI QPI 5.86 GT/s 5.86 GT/s 10-Gigabit PCIe 2.0 8 channels Northbridge

Figure 4.1: Simplification of block diagram of the S7002 motherboard configuration [46, p. 19]

36 4.2.2 Machine B – KTH Model Xeon [email protected] L1i Cache 32K L1d Cache 32K CPU * 2 L2 Cache 256K L3 Cache 25600K QPI 9.6 GT/s

Motherboard The exact name of the product is ProLiant DL380 Gen9; an HP server board which is from 2014. There were two CPUs used, hence implying NUMA nodes. Important note, the official block diagram was not found.

Memory A total of 98GB of RAM were present on the system.

NIC The previous model from Machine A was moved over this machine to check performance differences between the two, hence the same 82599ES 10-Gigabit controller was present. The driver used was ixgbe version 4.3.15.

Distribution To carry out experiments on pktgen, the same Bifrost 7.2 distribution along with the kernel 4.4 was tested on this machine. Also a Fedora 23 server version was used, with a Kernel 4.4, to work on eBPF. MACHINE B Memory Memory channel channel DDR3 Xeon QPI Xeon DDR3 DDR3 9.6 GT/s E5-2650 v3 E5-2650 v3 DDR3 DDR3 2.30GHz DDR3 DDR3 2.30GHz 10 cores 10 cores DDR3

82599ES 10-Gigabit

PCIe 3.0 8 channels

Figure 4.2: Simplification of block diagram of the ProLiant DL380 Gen9 motherboard configuration.

37 4.2.3 Machine C – Ericsson Model Xeon E5-2658 v2 @ 2.40GHz L1i Cache 32K L1d Cache 32K CPU * 2 L2 Cache 256K L3 Cache 25600K QPI 8 GT/s

Motherboard The intel motherboard S2600IP was used to carry out the experiments on this machine. Please take note that this board was faulty, as explained in the results section, and in no way we support its use to carry out experiment related to high-speed networks over it.

Memory A total of 32GB of RAM were present on the system.

NIC Intel’s Ethernet Controller XL710 for 40GbE QSFP+ transceivers was connected to this machine. The driver used was i40e version 1.5.16.

MACHINE C Memory Memory channel QPI channel 8 GT/s DDR3 Xeon Xeon DDR3 DDR3 E5-2658 v2 E5-2658 v2 DDR3 DDR3 2.40GHz DDR3 DDR3 2.40GHz 10 cores QPI 10 cores DDR3 8 GT/s XL710 40-Gigabit

PCIe 3.0 16 channels

Figure 4.3: Simplification of block diagram of the S2600IP [47] motherboard configuration.

On a side note, we did not have physical access to this machine but we were allowed to supervise the settings of the machine to check the hardware was put in correct PCI slots, as one can not map a BUS number to a physical slot from commands.

38 4.2.4 Machine D – Ericsson Model Xeon E5-2680 v4@ 2.40GHz L1i Cache 32K L1d Cache 32K CPU * 1 L2 Cache 256K L3 Cache 35840K QPI 9.6 GT/s

Motherboard The intel motherboard S2600CWR was used to carry out the experiments on this machine.

Memory A total of 132GB of RAM were present on the system.

NIC Intel’s Ethernet Controller XL710 for 40GbE QSFP+ transceivers was connected to this machine. The driver used was i40e version 1.5.16. As previously, we did not have direct access to the machine but we were granted a permission to check MACHINE D Memory channel QPI 9.6 GT/s DDR3 Xeon DDR3 E5-2680 v4 Empty DDR3 CPU slot DDR3 2.40GHz 14 cores

XL710 40-Gigabit PCIe 3.0 16 channels

Figure 4.4: Simplification of block diagram of the S2600CWR [48] motherboard configuration the settings.

39 4.3 Choice of Linux distribution

ELX To follow Ericsson’s policy on security, it was strongly advised to install ELX, which is Ericsson’s home- brew version of Ubuntu with enhanced security updates. It was primarily used to compile whatever version of the kernel needed for the experiments and transfer the resulting boot image to the target distribution. The installation is trivial as there is an included GUI (as for Ubuntu) that will make all the choices for the user; e.g. ciphering the hard-drive by default.

Arch Linux [49] On the recommendation of an employee at Ericsson we installed Arch Linux, the principal reason being the very active community and the constant updates being brought to the distribution. The drawback being the fact that it does not include any kind of graphical interface by default, making the installation fairly lengthy in command line. On the other hand seeing that, preferably, the latest stable release of the kernel should have been used to carry out experiments it was the best choice to easily compile new versions make use of the kernel. This distribution was also mandatory used as the Machine C from Ericsson was pre-configured with the OS and we did not have the rights to modify it.

Bifrost– 7.2 [50] Bifrost is a distribution who aims to give a small, network-oriented Linux distribution. Its small size is a result of a no-frills mentality stripping down a lot of usual commands and programs that are commonly found (e.g. Python, Perl) but comes with extra packages designed to monitor and help manage network- related attributes of the machine. This distribution’s kernel is not trivial to modify, as it have a special initramfs that has to compiled with the kernel in order for it to work. On another side note, in order to easily boot on different kernels by default there is a Syslinux bootloader included with the bifrost image. While avoiding the need of installing one for the user, tweaking its content is a rather painful manoeuvre and we made the choice of installing grub to simplify the process of updating through a single command. The installation of bifrost is fairly simple, as there are two solutions to install it: either decompress the OS directly on the root of the key but several commands must be executed in order to install the boot-loader or do a carbon-copy of the provided image from their website. This second method comes with a drawback or having a mandatory file-system with a fixed size of 1 GiB. However this can be extended through several commands, Cf the appendix A.1.

Ubuntu – 16.04 [51] For the same reason as Arch Linux, Ubuntu was mandatory as it was installed on Machine D from Ericsson and we did not have the rights to modify it.

Fedora – 23 [52] As we ran into numerous troubles with the installation of the BCC framework (notably a total break- down from the packet-manager pacman on Arch) we decided to ultimately switch to a distribution which we were used to manipulating, and that had pre-compiled binaries for the framework. We went for the server version to get less graphical interface bloated softwares, as installed on USB sticks having a GUI may cause severe latency at boot.

4.4 Creating a virtual development environment

As we did not have direct access to machines upon our arrival in Ericsson, we decided to set-up a virtual machine to be able to develop without risking the safety of our machine and especially the office network. Therefore we installed, on referral from a colleague, Arch Linux. We then compiled our own version of the kernel 4.4 to acclimate ourselves to the procedure. However the limits of such an infrastructure were quickly reached, not only performance wise but as

40 a virtualized architecture is substantially different than an actual one, several problem occurred. For instance, trying to profile the virtual machine became a hassle as perf did not have access to hardware events.

4.5 Empirical testing of settings

A good length of our time was spent trying to find settings that could influence the overall performance of pktgen, and hence the kernel itself. The first step was to check a large range of pktgen parameters and see whose presence or absence led to the most significant change. Quickly the burst variable along with sk clone turned out to be sky-rocketing the overall traffic. Running a ”vanilla” experiment of pktgen, meaning without options supposed to en- hance the speed or latency of the system turned out to be quite slow however drawing a good baseline to compare with.

An important note regarding pktgen experiments using the bifrost distribution, the version 7.0 and on includes patch from Daniel Turull that have not been added in the official kernel tree. This is important since those patches concern the receiver side, and since we only care about transmission we can discard the change. Moreover during our profiling of the system, the functions introduced by those patches often turned out to be on the most amount of sampled collected, implying they are still called from the sender side somehow and perhaps lowering the maximum amount of throughput. Note that this could be a side-effect from perf instead of an actual problem.

This sole process of trying several times, while being automated through scripts, took at least a hundred hours to carry out all the experiments required, usually because we did several nested loops to see if two parameters somehow conflict or benefit according to the value of their parameters.

The scripting itself was first realised with a bash script. Interesting note, on bifrost the built-in echo command does not function correctly when redirect towards the /proc/net/pktgen, hence one should use /bin/echo instead. To avoid having to constantly monitor the experience and manually stop it, we always limit the amount of packets. A as explained in the literature review, pktgen must run for at least a few milliseconds to ensure reliable results. To enforce this rule, we always set to at least a million packets when running minimum sized packets transfers. This is usually enough for 10G and 40G networks. After a pktgen experiment was run, the results are caught and stored in a simple text file, as there is not any requirement for complex encoding or compression. Moreover it makes the operation fast, making the loop run faster.

Post-processing To make interesting data out of the one harvested, we used Perl scripts to be able to easily loop through the results. With the -n or -p setting, Perl adopts a behaviour to the one from the awk command, but providing more flexibility with its built-it regular expression parser, making the recognition of text patterns easy. As we usually ran the script from 1 to 8 or 10 cores the amount of lines expected were easily calculated (e.g. if ran on 5 cores, 5 lines of results expected) and allowed short and elegant parsing solutions like the one provided in the appendix B, even when looped over several hundred times.

4.6 Creation of an interface for pktgen

While we believe that pktgen is not suited to be used for persons for whom the kernel performances have little interest, perhaps preferring plain bandwidth tests perfectly filled by tools like iperf, we think that having data from a larger set of users would be interesting. But we believe the kernel interaction through the procfs is too esoteric for pktgen to be used by some users. On top of that the documentation provided in [37] never in fact clearly stipulate how to interact with the /proc, which can be misleading for neophytes. In the same documentation the links at the bottom are not reachable any-more, on the

41 other hand in the kernel tree source the directory samples/pktgen is filled with concrete examples. We created a program whose aim would be to kill two birds with one stone:

• Provide a simple command-line interface for pktgen. This includes short-cuts to different settings to be provided, and the possibility to aggregate them to several threads instead of having to program each thread one by one. Also it stores the current configuration to a subsidiary configuration file for the user to re-create a carbon copy of the experiment.

• Standardize the performance results of pktgen through the ability to export easily its results and numerous system metrics that might be of influence. As said previously, performances turn out to be meaningless if not coped with several paradigms hence the program aimed to export in a portable format that could be

– Parsed by the original program to produce a simple, and if needed reduced, output. – Understood by browsers, as sharing on blogs/websites is a common practice in the kernel development community. – Pretty easily human if needed.

To fulfil the above requirements, the output was produced in the JSON format.

This program was also created because of a simple problem: constantly variating parameters with scripts quickly became messy, as constant editing of the same file or having different versions of the same file often ended up in confusion. At least on the scale of thousand of different experiments over several machines. The program was written in the Perl language for several reasons:

• We already had a certain affinity with it.

• It is included with most Linux distributions.

• It has more advanced features than Bash.

• Several sample scripts included with the Linux kernel are already written in Perl.

• It is allegedly the language with the most performance in text parsing [53].

The software is about 300 lines of code and stores the custom configuration to a temporary file, to allow re-using the same configuration very easily. It will prompt the options given to the user by calling the --help parameter.

The primary strength of this script is to allow configuring all threads on a single line: when passing the -t argument you can either give:

• A single integer

• A list separated by comas

• A range separated by a dash.

For example: pktgen -t 0-3 -d eth0 -c 10000 -b 10 Will configure pktgen threads from 0 to 3 included, to use the interface (device) eth0 with 10000 packets with an associated burst value of 10. To launch the program simply do pktgen run or append it to the previous command. Re-doing the same command will launch the same exact configuration. To ensure two instances of the script can not be ran concurrently, lockfiles were added. When giving the -w FILE parameter an output will be done on the given file. It will contain the results from each thread along with the integrity of the parameters pktgen took.

42 pktgen v0.1 -p Print current configuration. -r Remove all devices -f Flush, clean all configuration and remove all devices. -t NUMBER Bind actions toa specific thread number. -d INTERFACE Bind actions toa specific interface. Mandatory. -c NUMBER Set number of packets. 0 For infinite generation. -s NUMBER Set size of packet. Minimum 60, maximum should be MTU. -D NUMBER Set delay between packets. -C NUMBER Set amount of cloned packets. -b NUMBER Set amount of bursted packets. -md MAC Modify MAC destination address. -ms MAC Modify MAC source address. -adIP ModifyIP destination address. -asIP ModifyIP source address. -w FILE Output the results toa JSON file. Here is the helper.

Figure 4.5: Output using the –help parameter on the pktgen script.

4.7 Enhancing the system for pktgen

As explained in 2.4.4 there are several things we can tune for pktgen to achieve maximum throughput.

Disabling frequency scaling The purpose was to avoid frequency scaling which could skew the results especially if the transmission was short. The frequency scaling is a great power saver and should not be disabled on a normal basis. [54] On kernels post version 3.9 the frequency scaling is in fact regulated through a driver. For Intel CPU, which was the only brand tested here, the driver called pstate manages the frequency scaling. There should be ways to interact with the driver; however the commands are not available on all dis- tribution, to facilitate and generalize the procedure we simply disabled it. To do so: one must add intel pstate=disable to the booting kernel line. For instance if you use Syslinux:

1. Search for the configuration file. Usual emplacement are /boot//syslinux.cfg, /syslinux/syslinux.cfg, /syslinux.cfg .

2. With an text editor, open the file and find the section matching your kernel version (uname -r prompts it.

3. On the line starting with ”APPEND”, add intel pstate=disable.

4. Reboot

You can now set a CPU frequency governor for your cores [55]. It is the policy your CPU will follow. We have to select the ”performance” governor to make sure the frequency will be at its maximum potential without variation. Do do so, you must write inside the /sys/devices/system/cpu/cpu*/cpufreq/scaling governor file. Example: echo "governor" > /sys/devices/system/cpu/cpu0/cpufreq/scaling governor To check the frequency scaling happens correctly you can monitor the frequency of all cores by running watch grep "cpu MHz" /proc/cpuinfo . If you see variation, the governor was not set properly or the driver is still in place.

IRQ Pinning The kernel sets an interrupt ”affinity” for each interrupt that is registered, which can be translated as a

43 list of allowed CPU to catch and treat the interrupt. This is implemented as a bit-mask corresponding to the cores allowed. The list of interrupts registered by the OS can be found in /proc/interrupts, and its as- sociated number. To check the allowed cores we must check the value of /proc/irq/X/smp affinity, with X being the corresponding number of the affinity. When you need to pin an affinity to a particular core, you need to calculate the bitwise mask like so: 2core. You can also stack them by adding several bit-masks together. Example to pin interrupt 40 to core 3: 23 = 8, then echo "3" > /proc/irq/40/smp affinity . Keep in mind that the numbering starts from 0.

Interrupt load balancing On certain systems there is a daemon that does takes care of setting the interrupt masks to balance the system load. It is called ”irqbalance”, however as we do manual IRQ pinning this will collide are change our settings, hence if this is present on your system you must disable it.

C-STATES could be an issue as it was shown to introduce added latency. To disable it one must go in the BIOS and look for C-STATES and set them to ”performance” or equivalent setting to have minimum issues with it.

Adaptive Interrupt Moderation The hardware creates interrupts at a certain interval when receiving and sending frames. If there is 1 interrupt per frame, the overhead caused can end up in such a CPU usage that it will ultimately become the bottleneck. Adaptive Interrupt Moderation has to be left on for to achieve maximum throughput as it saves CPU consumption. This can be set up with the modprobe command and given as a parameter. For instance with the i40e driver: modprobe i40e InterruptThrottleRate=1 enables it, however this is the default value.

Segmenting sender and receiver This scenario was always respected as we tried to gather the best possible performance from one machine therefore did not want to get issues from having to share the same system for the two functions. However the receiver was never investigated as it was used as a black-hole for packets, its only purpose was to give the interface a carrier and checking the connectivity happened correctly to avoid flooding a regular network with pktgen packets.

4.8 pktgen parameters clone conflict

While examining the code of pktgen we found out that the current implementation of the xmit more API (aka burst parameter) was currently colliding with the cloning of packets (aka skb clone parameter). The code manually tampered the reference counter to be incremented for the same value as the burst parameter, making it a clone even though no clone skb had been passed. 3450 atomic_add(burst,&pkt_dev->skb->users); 3451 3452 xmit_more: 3453 ret= netdev_start_xmit(pkt_dev->skb, odev, txq, --burst > 0); Listing 4.1: Lines proving the incoherent behaviour in pktgen.c (v2.75)

This did not create any problems when using both parameters but according to the program’s spec- ification, there is no obligation to clone the packet when using the xmit more API. In the current state of things you can not use the burst without an inherent skb clone. A patch was crafted to fix this issue (Cf Appendix C.4 ). However under the lack of review it got, it was not applied. For our experiments this problem is not critical as we are looking for the best performance achievable, hence stacking the xmit more capabilities on top of cloning would have been mandatory anyway.

44 Chapter 5

eBPF Programs with BCC

This section is dedicated to showing the reader how eBPF programs were created. Figuring the ways to create eBPF programs was a lengthy process. For the reader to understand the code created in the results section, we will give a short overview of its structure with the help of BCC [44].

5.1 Introduction

The programming on BCC is divided into two parts, the eBPF program which will be the in-kernel code executed, and the front-end interaction written in Python to read the results from the execution of the latter. eBPF can be executed at several points in the kernel. • socket: As it was firstly intended, you can bind an eBPF program to a socket to check the traffic in between. • xt bpf: Kernel module for netfilter • cls bpf: Kernel module for classifier for traffic shaping • act bpf: BPF based action (since Linux 4.1) • kprobes: BPF based kprobes The above list is not inclusive however we only use the kprobe hook.

5.2 kprobes

While they are not a new technology, kprobes are an efficient way of putting a tracepoint in the kernel. The traditional way of adding them requires to compile a whole new kernel module (or modifying a previously existing one) and adding it to the system, for the OS to register and activate the kprobe. This method is complex especially as the user shall see kprobes as a dynamic, convenient way to add breakpoints to functions that require monitoring or tracing. The perf events tool allows to add kprobes but can not run dedicated programs onto them. This is where eBPF comes in line.

Hello, world! The bcc/examples/hello world.py file gives a good and simple overview of how to add a kprobe, however for the sake of segmenting the code into relevant parts we will tweak it into the following two code listings. This segmentation between C and Python should always be used for the sake of the clarity of the code; even though the original hello world.py found in the repository does not.

1 #include 2 void handler(void *ctx){ 3 bpf_trace_printk("Hello, World!\n"); 4 } Listing 5.1: hello world.c

45 The listing 5.1 code shows the minimum amount of code needed to create a kprobe with a .c file attached.

1. Inclusion the user-API of ptrace, required to bind a kprobe with BCC.

2. The handler function will be called when the probe is hit, and MUST have a context pointer, as all other eBPF functions do when in kprobes. If needed, extra arguments can be added to fit the prototype of the function probed and then access their values.

3. A helper included in the helpers.h and simplifies the interaction between user-space and kernel- space print, as the original printk normally redirects to the dmesg output.

1 #!/usr/bin/python 2 from bcc import BPF 3 b=BPF(src_file="hello_world.c") 4 b.attach_kprobe(event="sys_clone", fn_name="handler") 5 b.trace_print() Listing 5.2: hello world.py

The above code represents a classical kprobe monitoring program with BCC. The first two lines are mandatory, and will not work without the kernel headers being installed on the distribution. This is because of the .c file including the user-API for , but the prompted error is not explicit. The 3rd line consists in initializing the program by giving it a source file for the eBPF program to be run. Note that the BPF object in fact recognizes key-words inside the program and will add automatically the corresponding headers if they are missing; however the list is small and the user should not rely on it. When initializing you can pass the program either as a single separate c file, as recommended, or a string of text. Doing both will not function correctly. The 4th line will create and attach the kprobe to the given event, which is a kernel symbol. The handler function must be provided in the ”fn name” parameter. Last but not least the 5th lines simply waits indefinitely for data from a bpf trace printk call and will prompt it to the user.

5.3 Estimation of driver transmission function execution time

We aimed to use eBPF to check if the performances showed by pktgen were accurate or not. In this case we aimed to verify if some unusual driver latency could be revealed. Thus we used eBPF as a way to calculate the latency by binding kprobes onto it. This experiment was based on the assumption that if we were running a single pktgen thread, only a single core should be generating traffic hence driver functions should not be called concurrently. Therefore, one can calculate the amount of time the driver took to send packets by taking a timestamp at the beginning of a driver function, and making the difference with a timestamp taken at the return of the same function. With pktgen there is not any way to uniquely identify each SKB: because of the burst option, the same packet is being passed to the driver for copy. Hence there is no way to differentiate two SKBs as they are the exactly the same when cloned. In our case, the function traced was i40e lan xmit frame. The program was split into two parts as explained in the work section. The C code is composed of two handler functions: BPF_TABLE("hash", u64, u64,start, 262144); int handler_entry(struct pt_regs *ctx,struct sk_buff *skb, struct net_device *netdev){ u64 ts= bpf_ktime_get_ns(); u64 key=0; start.update(&key,&ts); return 0; } Listing 5.3: kprobe at entry of i40e lan xmit frame

The BPF TABLE macro will create the eBPF maps, called start. The handler entry will execute as such:

46 • Firstly we will fetch the current timestamp through an eBPF helper function bpf ktime get ns().

• We store the value of the timestamp in the start map under the key 0.

• Important note: Anything that has to be stored inside an eBPF map has to be done through a variable pointer with an initialized value. If one tries to store by giving an immediate and getting its pointer (e.g. &0) the program will not compile. Hence the ”key” variable with value 0.

• We store the timestamp value in the map under the ”0” value. This is done to avoid having to dedicate a second map just for this sole variable. This is not a problem since the other values, representing the length of time taken by the function to run, can not be equal to 0. void handler_return(struct pt_regs *ctx,struct sk_buff *skb, struct net_device *netdev){ u64 *tsp=NULL, delta=0; u64 key=0; tsp=start.lookup(&key); if(tsp != 0){ delta=bpf_ktime_get_ns(); delta-=*tsp; start.increment(delta); } } Listing 5.4: latency measurement from driver interaction

For handler return:

• We create a u64 pointer to hold the timestamp value, and initialize a delta value which will be the difference between the current time and the one stored in the map.

• The key has the same purpose as in the handler, i.e. retrieving the timestamp and respecting the eBPF access paradigms.

• We retrieve the value of the previous timestamp in tsp.

• If the timestamp has a value 6= 0, we retrieve the current time, make the difference and store it in delta.

• The increment method of the table looks for the key and increment the value associated. Hence the map content will be a pair of a key representing the execution time and a value repre- senting the number of occurrences under which the function was executed at the same speed. e.g. key 450 and value 12 means that the execution time of 450 nanoseconds happened 12 times.

The associated python code is minimal. #!/usr/bin/env python from bcc import BPF from time import sleep b= BPF(src_file="xmit_latency.c") b.attach_kprobe(event="i40e_lan_xmit_frame", fn_name="handler_entry") b.attach_kretprobe(event="i40e_lan_xmit_frame", fn_name="handler_return") print "Probe attached" try: sleep(100000) except KeyboardInterrupt: for k,v in b["start"].items(): #Calculate mean, variance and standard dev Listing 5.5: Python code to attach the probes and retrieve the map data

• We must import the bcc module in order for it to function.

• We associate the C program with the python front-end.

47 • Then the functions are bound as a kprobe and a kretprobe onto i40e lan xmit frame which we want to investigate.

• We wait for the user to create an interrupt (CTRL+C)

• We then loop through the pair of key/values present in the map. Be cautious to ignore the key with a value of 0, as it hold a timestamp and will completely skew the statistics.

48 Chapter 6

Results

This chapter will be dedicated to the results of the pktgen experiments along with the profiling realised with perf and eBPF.

6.1 Settings tuning

This section will regroup the results from the different metrics tested on each machine, to establish which ones are current producing the greatest amount of throughput. All the data of this section use a packet size of 64 bytes.

6.1.1 Influence of kernel version It was required to investigate the influence of the kernel version on global results. To do so, all the long term versions from 3.18 (introduction of the xmit more API) and onwards were tested: 3.18, 4.1, 4.4, 4.5 and 4.6 (as a release candidate as the time of the tests). The figures 6.1a and 6.1b clearly demonstrate how close of a performance there are between the different version of the kernel. The kernel 3.18.32 seems to have a slightly off calculation of the throughput, as it showed performances above the theoretical limit that is achievable. On the other hand, the kernel 4.5.3 seems to be slightly under the others, perhaps due to another miscalculation or simply because of the kernel being less efficient under that version. Because of this benchmark we decided to settle for version 4.4:

• It was the latest long term version available at the beginning of this thesis.

• It seemed to show accurate performance.

• As of April the machines C and D were patched under the version 4.4 by the administrator, hence using it on the machine A and B for equity seemed a fair strategy.

• It had extra eBPF features compared to older versions, which could come in handy for later.

6.1.2 Optimal pktgen settings There are only a few options that could be modified variate the performance: skb clone, burst. All the following graphs only have a few tested parameters shown however the actual testing range was in fact much greater. For instance if a graph shows a burst value of 10 and 1000, ranges from 10-100 were also tested, including 1000 10000...But if they did not show any significant difference hence they are hidden for readability purposes. As it turns out the skb clone parameter is currently embedded when using burst (Cf 4.8), and will not be shown on the graph also for readability purposes. A value of ”burst 1” is the baseline, it is the default setting and does in fact not profit from the xmit more API.

49 (a) pktgen throughput with no options

(b) pktgen throughput with burst 10

Figure 6.1: Benchmarking of different kernel version under bifrost (Machine A)

Interpreting the results:

• The figure 6.2a shows a slight advantage of speed with a burst value of 10, until the different profiles merges into the line rate (14.88) around 4 cores used.

• The figure 6.2b on the other hand shows a small disadvantage advantage of speed with a burst value of 10, until the different profiles merges into the line rate (14.88) around 4 cores used.

• The figure 6.2c is on another scale than the others because of its 40G NIC. While during the starting phase (1 to 4 cores) the results are similar, however from 5 to 4 cores there seems to be an advantage by a great distance from 5 core and on.

On a machine with default ring settings, it seems the best performances are around a value with 10 burst to provide consistency among all machines.

50 (a) burst variance on machine A – 10G

(b) burst variance on machine B – 10G

(c) burst variance on machine D – 40G

Figure 6.2: Performance of pktgen on different machines according to burst variance.

51 6.1.3 Influence of ring size The default transmission ring size can be obtained with the help of ethtool -g dev . This represents the size of the ring buffer shared between the kernel and the NIC, and is managed by the driver. In the pktgen official documentation [37] it is advised to augment the size of the original buffer as ”the default NIC settings are (likely) not tuned for pktgen’s artificial overload type of benchmarking, as this could hurt the normal use-case.”. This recommendation is likely to be issued by Jesper Brouer [56] as he is also the author of the xmit more API. Therefore we created a nested loop to test the influence of the ring size along with associated burst value, while monitoring the throughput. This is done on a single core, on machine D.

Figure 6.3: Influence of ring size and burst value on the throughput

The above figure starts its burst value at 5, and ends a 100. This choice was made due to the baseline (burst 1) yielding results too small to be shown on the graph properly, far under 8 millions on any ring size. The collected data depicts two things:

• The original ring size (512) is indeed too little compared to the maximum amount of performance. Setting a value between 1024 and 4096 does not seem to seem to influence the maximum throughput of pktgen.

• Whilst the burst size of 10 may in fact be the best setting for small ring sizes (512-640 on this graph) but the best settings ever achieved are around a burst value of 25 to 40, with a ring size from 800 to 4096.

The recommended value of 1024 seems to be more of a mnemonic than a factual threshold that influences greatly the overall performances of the software, but increasing it until maximum performance is reached is necessary.

52 6.2 Evidence of faulty hardware

We will now showcase the discovery of a problem related to hardware from machine C, leading to completely discarding the results acquired from it.

Figure 6.4: Machine C parameter variation to amount of cores

The above figure shows is a classical benchmark of the system by varying the burst and skb clone values. However the plateau shown when the burst is being set at 10 and 100 barely reaches a total of 22 Mpps. While one might think this is due to limitation from either the CPU of the kernel’s performance, making a test with MTU sized packets revealed achieving a simple bandwidth test for 40G did not work correctly. The figure 6.5 clearly shows a bandwidth threshold maintained around 26 to 28 Megabits per second. And since the figure 6.4 shows the hardware should be capable of producing at least 20 Million packets per second, it is very unlikely the NIC is not able to do the required 3.289 million packets per second required to saturate the link. A good hint to unveil the issue was located in the kernel buffer output read with dmesg | grep "i40e": ”PCI-Express bandwidth available for this device may be insufficient for optimal performance”. As we assumed the administrator did not place the NIC in the correct slot, as the block diagram showed another slot with PCIe 2.0 instead of 3.0. But even after moving the fastest slot available (Slot 1 – PCIe 3.0 x16) the results still stalled at the same total throughput, and the kernel message remained. After doing research we found out that Intel issue several technical advisories [57] [58] stating there are issues with PCIe connexions. It is therefore very likely that the board is automatically downgrading the PCIe 3.0 to a 2.0 version. In [57] it is stated there upgrading the BIOS would fix the issue, however we did not have physical access to the machine, located in a data-center. We asked for an administrator to do the upgrade, however we were answered that the BIOS-upgrade attempt he performed was not successful, and hence will not be performed on live machines. This is what led us to discard all performance results from this machine.

53 Figure 6.5: Machine C bandwidth test with MTU packets.

6.3 Study of the packet size scalability

This section has entirely been realized over the machine D, as it was the only one with a 40G NIC. We realized similar tests on machines A and B however the same behaviour could not be reproduced most probably because of the throughput being too little to notice issues.

6.3.1 Problem detection The idea of the test was fairly simple: scaling incrementally the size of the packet until line-rate is reached.

Figure 6.6: Throughput to packet size, in million of packets per second.

54 The expected behaviour is defined by having the initial performance constant until the theoretical limit is met, under which the amount of packet should obviously fall under. The figure 6.6 clearly shows regular loss of packet throughput at growing intervals, for instance at 606,703, 840, 1044 and 1386. Note that varying the burst size (> 1 otherwise it does not reach the line rate) or ring buffer size does not prevent for those drops.

Another approach to visualize the problem is to simply trace the amount of throughput in Mbps. This is done in the figure 6.7 and another problem become obvious: because of the packet loss, the linespeed of 40G is not met when using an MTU (1500) sized packet, whilst it when using smaller sized packets under the exact same configuration (e.g. 1200 bytes) a 40G bandwidth is achieved. This implies that if used on a 100G board, because of this ”sawtooth” profile it is very unlikely that pktgen is able to saturate a link with a single core.

Figure 6.7: Throughput to packet size, in Mbps.

6.3.2 Profiling with perf As we did not have any machine to compare to with a 40G NIC, we first tried to monitor events with perf. The idea was to run a batch of perf statistics tests over particular events, mostly hardware events were investigated as we thought of cache issues on a hunch, and to compute the Pearson product-moment correlation coefficient. Also to to visualize the results, as the latter formula is only valid for linearity and might obfuscate other behaviours, we put them on top of the 6.7 graph to see if there were any events matching the losses. Needless to say the same event monitoring experiment had to be run several times, as there might have been some spurious events that skewed the results, making it more of a coincidence than a meaningful implication between the problem and the explanation. However none of the hardware metrics tested corroborate a direct implication of the hardware in this issue, the Pearson’s formula always yielded r values considered as ”very weak”, often between 0.05 to 0.1. Figure 6.8 is an example of this procedure, the green data representing the same pattern as found of the figure 6.7 but slightly shrunk due to the range of the y axis being grown. This choice was made to align the two sets of data and finding correlation between the two more easily.

55 By interpreting this specific result, one can not see an obvious match between the two sets, implying there might not be one at all. The same sort of result were found in all the perf hardware events tested, including: LLC-store-misses, LLC-stores, branch-load-misses, branch-loads, iTLB-loads, node-load-misses, node-loads, node-store- misses, dTLB-load-misses, dTLB-loads, dTLB-store-misses, iTLB-load-misses, cpu-migrations, page- faults, context-switches. So far none of them revealed an issue that would rationalize the latter problem.

Figure 6.8: Superposition of the amount of cache misses and the throughput ”sawtooth” behaviour.

The figure 6.8, when applied Pearson product-moment correlation coefficient formula yield an r value of ≈ 0.09, which we therefore interpret as having close to no linear correlation at all.

6.3.3 Driver latency estimation with eBPF Another attempt into finding a justification to the problem was to create a kprobe through an eBPF program to calculate the amount of time the driver took to send each packet. If some unusual latency was found it would possibility indicate hardware problems located on the NIC or driver issues. This procedure was explained in section 5.3.

Execution overhead While the program will work on the system, it is inapplicable in real life situation in high-speed network because of the amount of overhead caused by it.

Size of packet Without eBPF With eBPF Mean latency measured 64 5200 Mb/s 686 Mb/s 520ns 500 33100 Mb/s 4410 Mb/s 542 ns 1000 39600 Mb/s 10800 Mb/s 533 ns 1500 38100 Mb/s 16170 Mb/s 550 ns

Table 6.1: Comparison of throughput with eBPF program

The table shows the amount of overhead created by the implementation of the probe with eBPF. Note that the speed with a size of 1500 bytes and the eBPF program loaded is in fact the only scenario found

56 were a better throughput is achieved compared to a size of 1000 or 1200 bytes. However the interpretation is trivial: when a minimum sized packet is used, there are more packets per second therefore more kprobe hit, causing more overhead. Hence a bigger packet size causes less overhead because the amount of packet sent is lesser. Moreover the mean latency calculation did not show an increase in the latency of execution, perhaps because of the xmit more API being enabled, the actual sending is delayed making the function execution time almost static. Regarding the actual results of the latency measured they enable us to draw two conclusions:

• The time spent executing the function is in fact not related to the size of the packet. While the name of the function is in fact ”xmit” (for transmission) and is the lowest function the kernel has access to for packet transmission, it does not immediately transfer the packet to the NIC. Instead, it copies the packet content located inside the SKB to the driver ring buffer. Hence the constant nature of the latency measurement are expected.

• It is quite hard to figure the accuracy of such measurements because of the extreme granularity required. Nano-second precise measurements are extremely hard to achieve because any kind of instruction takes several nano-seconds to be executed and hence can not be neglected. In this example, as we can not assess the amount of time taken accessing or storing the information of the timestamps inside the eBPF maps, those operation might skew the results dramatically, or any other eBPF instruction for that matter. And giving a measured latency of (Cf Table 6.1) ≈ 500 nano-seconds, this implies copying a 64 bytes (512 bits) packet into the ring buffer takes the equivalent time of sending 2500 bytes (20000 bits) over the wire, which is unrealistic for high-speed transmission.

While this attempt at using eBPF for profiling was not successful, there are currently efforts to put a new eBPF hook optimized for throughput [59] performance assessment.

57 Chapter 7

Conclusion

Concerning the results gathered, it provides us with several insights:

• The clone skb option of pktgen is currently useless when coped with burst. This is due to burst having the same advantages but being even more efficient because of the xmit more API.

• To use the full potential of the machine, having a ring-size of 512 is not enough. However the recommended value of 1024 is not mandatory, as anything in the range of 800 to 4096 is functional as well. An associated burst value of 30 to 40 is the best possible setting found, with a rate of 13.2 million packet per second achieved on a single core on machine D.

• There is an issue issue with the packet size on machine D, as experiments with a single core and MTU sized packets are producing less bandwidth than packets with a size 1000. The hardware profiling was unsuccessful.

• eBPF profiling seems like an interesting option but raises too much overhead to provide usable data.

We are now able to answer our initial hypothesis. With pktgen, the line-rate of 10G is reached with 2 to 4 cores depending on the machine. However the line-rate of 40G with minimum sized packets is not reached but an amount of ≈40 million packets per second were achieved, with high-end hardware. Therefore we can conclude aiming towards 100G line-rate is currently unrealistic given the current kernel and hardware conditions. During this thesis pktgen was deeply investigated, and we also gave a solid technical background going from computer hardware to profiling tools. We also gave an insight to the reader about the eBPF technology and its possible uses, notably with the use of kprobes. As this project is constantly evolving, this might become a very powerful technology to profile entire frameworks in the future, perhaps even drivers.

7.1 Future work

It is paramount that the packet size scaling problem is addressed, and attempt to recreate the problem on machines with different configuration but having a 40G NIC, as we believe 10G is too slow to trigger the problem. Extra profiling techniques, combining both eBPF and perf could provide a new angle of approach to the problem and help pinpointing it. Re-coding the pktgen interface currently written in Perl into C should help making it portable and usable for all distributions, hence growing the community of users. And the more data provided by the community, the more chances of studying the capabilities of a Linux-based system through pktgen.

58 Bibliography

[1] Sebastian Gallenm¨ulleret al. “Comparison of Frameworks for High-Performance Packet IO”. In: ANCS ’15 Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for networking and communications systems (2015), pp. 29–38. [2] Alessio Botta, Alberto Dainotti, and Antonio Pescap´e.“Do You Trust Your Software-Based Traffic Generator?” In: IEEE Communications Magazine (2010), pp. 158–165. [3] Olof Hagsand, Robert Olsson, and Bengt G¨ord´en. “Open-source routing at 10Gb/s”. In: (2009). [4] . Linux Kernel Developement. 4th ed. Addison-Wesley, 2010. [5] Robert Olsson. “pktgen the linux packet generator”. In: Proceedings of the Linux Symposium 2 (2005). url: https://www.kernel.org/doc/ols/2005/ols2005v2-pages-19-32.pdf. [6] Arnaldo Carvalho de Melo. “The New Linux ’perf’ tools”. In: Linux Kongress (2010). [7] Jonathan Corbet. Extending extended BPF. 2 July 2014. url: https://lwn.net/Articles/ 603983/. [8] . NAPI. 2009. url: http://www.linuxfoundation.org/collaborate/ workgroups/networking/napi. [9] Jonathan Corbet. Bulk network packet transmission. 17 May 2016. url: https://lwn.net/ Articles/615238/. [10] Christoph Lameter. “NUMA (Non-Uniform Memory Access): An Overview”. In: acmqueue 11.7 (2013). url: https://queue.acm.org/detail.cfm?id=2513149. [11] PCI-SIG. PCI Express Base Specification. Specification. Version Rev. 3.0. PCI-SIG, Nov. 2010, pp. 192–200. [12] S. Bradner and J. McQuaid. Benchmarking Methodology for Network Interconnect Devices. RFC. IETF, 1999. [13] Bryan Henderson. Linux Loadable Kernel Module HOWTO. 10. Technical Details. Version v1.09. 2016. url: http://www.tldp.org/HOWTO/Module-HOWTO/x627.html. [14] Patrick Mochel and Mike Murphy. sysfs - The filesystem for exporting kernel objects. 16 August 2011. url: https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt. [15] Joel Becker. configfs - Userspace-driven kernel object configuration. 31 March 2005. url: https: //www.kernel.org/doc/Documentation/filesystems/configfs/.txt. [16] Thomas Petazzoni. Network drivers. Free Electrons. 2009. url: http://free- electrons. com/doc/network-drivers.pdf. [17] David S. Miller. David S. Miller Linux Networking Homepage. 2016. url: http : / / vger . kernel.org/˜davem/skb.html. [18] Hyeongyeop Kim. Understanding TCP/IP Network Stack Writing Network Apps. CUBRID. 2013. url: http : / / www . cubrid . org / blog / dev - platform / understanding - tcp - ip - network-stack/. [19] Sreekrishnan Venkateswaran. Essential Linux Device Drivers. 2008. [20] Dan Siemon. “Queueing in the Linux network stack”. In: 2013.231 (July 2013). [21] Martin A. Brown. Traffic Control HOWTO. 2006-10-28. url: http://www.tldp.org/HOWTO/ Traffic-Control-HOWTO/classless-qdiscs.html.

59 [22] Tom Herbert and Willem de Bruijn. Scaling in the Linux Networking Stack. 2015. url: https: //www.kernel.org/doc/Documentation/networking/scaling.txt. [23] Jon Dugan et al. iPerf. 2016-04-12. url: https://iperf.fr/. [24] Juha Laine, Sampo Saaristo, and Rui Prior. RUDE & CRUDE. 17 May 2016. url: http:// rude.sourceforge.net/. [25] rick jones. RUDE & CRUDE. 17 May 2016. url: http://rude.sourceforge.net/. [26] P. Srivats. Ostinato. 17 May 2016. url: http://ostinato.org/. [27] Larry McVoy. lmbench. 17 May 2016. url: http://www.bitmover.com/lmbench/. [28] Sebastian Zander, David Kennedy, and Grenville Armitage. KUTE A High Performance Kernel- based UDP Traffic Engine. Technical. Centre for Advanced Internet Architecture, 2005. [29] ntop. PF RING Website. 17 May 2016. url: http://www.ntop.org/products/packet- capture/pf%5C_ring/. [30] L. Deri. “nCap: wire-speed packet capture and transmission”. In: End-to-End Monitoring Tech- niques and Services, 2005. Workshop on (15 May 2005), pp. 47–55. [31] Luigi Rizzo. netmap. 2016-04-12. url: http://info.iet.unipi.it/˜luigi/netmap/. [32] Intel. DPDK. 2016-04-12. url: http://dpdk.org/. [33] Paul Emmerich et al. “MoonGen: A Scriptable High-Speed Packet Generator”. In: Internet Mea- surement Conference 2015 (IMC’15). Tokyo, Japan, Oct. 2015. [34] Spirent Communications. Website. 2016-04-12. url: http://www.spirent.com/. [35] IXIA. Website. 2016-04-12. url: https://www.ixiacom.com/. [36] Daniel Turull, Peter Sj¨odin,and Robert Olsson. “Pktgen: Measuring performance on high speed networks”. In: Computer Communications 82 (Mar. 2016), pp. 39–48. [37] Robert Olsson. HOWTO for the linux packet generator. 17 May 2016. url: https : / / www . kernel.org/doc/Documentation/networking/pktgen.txt. [38] Stephane Eranian. Perf tutorial. 14-May-2016. url: https : / / perf . wiki . kernel . org / index.php/Tutorial. [39] Brendan Gregg. Linux Perf Examples. 14-May-2016. url: http://www.brendangregg.com/ perf.html#Events. [40] Kernel Newbies. Linux 4.4. 14-May-2016. url: http://kernelnewbies.org/Linux_4.4. [41] Jay Schulist, Daniel Borkmann, and Alexei Starovoitov. Linux Socket Filtering aka Berkeley Packet Filter (BPF). 24 Aug 2015. url: https : / / www . kernel . org / doc / Documentation / networking/filter.txt. [42] Suchakra Sharma. BPF Internals - I. 24 Aug 2015. url: https://github.com/iovisor/bpf- docs/blob/master/bpf-internals-1.md. [43] Alexei Starovoitov. LLVM Website. 17 May 2016. url: http://llvm.org/. [44] Iovisor project. BCC repository. 17 May 2016. url: https://github.com/iovisor/bcc. [45] statstutor. Pearson’s correlation. 20 June 2014. url: http://netoptimizer.blogspot.se/ 2014/06/pktgen-for-network-overload-testing.html. [46] MiTAC Computer Corporation. S7002 technical specification. 2009. [47] Intel. Server Board S2600IP. march 2015. url: http://www.intel.com/content/dam/ support/us/en/documents/motherboards/server/sb/g34153004_s2600ip_w2600cr_ tps_rev151.pdf. [48] Intel. Server Board S2600CW. April 2016. url: http://www.intel.com/content/dam/ support/us/en/documents/server-products/S2600CW_TPS_R2_1.pdf. [49] Judd Vinet and Aaron Griffin. Arch Linux. 2016-04-12. url: https://www.archlinux.org/. [50] Bifrost Network Project. bifrost. 2016-04-12. url: http://www.bifrost-network.org/.

60 [51] Canonical Ltd. Ubuntu Website. 14-May-2016. url: http://www.ubuntu.com/. [52] Inc Red Hat. Fedora. 2016-04-12. url: https://getfedora.org/. [53] Tim O’Reilly and Ben Smith. The Importance of Perl. 17 May 2016. url: http://archive. oreilly.com/pub/a/oreilly/perl/news/importance_0498.html. [54] Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. “Fine-Grained Dynamic Voltage and Frequency Scaling for Precise Energy and Performance Trade-off based on the Ratio of Off-chip Access to On-chip Computation Times”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24 (1 27 December 2004), pp. 18–28. [55] Dominik Brodowsk and Nico Golde. CPU frequency and voltage scaling code in the Linux(TM) kernel. 17 May 2016. url: https://www.kernel.org/doc/Documentation/cpu-freq/ governors.txt. [56] Jesper Dangaard Brouer. Pktgen for network overload testing. 4 June 2014. url: http : / / netoptimizer.blogspot.se/2014/06/pktgen-for-network-overload-testing. html. [57] Intel Corporation. PCI Express* 3.0 Add-in Adapter Support Issue. technical advisory. Intel, 2014. url: http://www.intel.com/content/dam/support/us/en/documents/motherboards/ server/sb/ta_102105.pdf. [58] Intel Corporation. PCIe link width may intermittently downgrade to x4 or x2 with one third party PCIe add-in card. technical advisory. Intel, 2012. url: http://www.intel.com/content/ dam/support/us/en/documents/server-products/ta1000.pdf. [59] Tom Herbert and Alexei Starovoitov. eXpress Data Path. 10 June 2016. url: https://github. com/iovisor/bpf-docs/blob/master/Express_Data_Path.pdf.

61 Appendix A

Bifrost install

A.1 How to create a bifrost distribution

The following instruction assume the device you are currently trying to mount on is a USB-Stick; most- likely mounted asynchronously. In case you had sensitive data on the device you are trying to burn on, please consider using the shred command. Before you get started, it is important that you check which one of your devices maps to the /dev/sdX

Burning the image on the device

• mkdir /tmp/bifrost && cd /tmp/bifrost

• wget http://bifrost-network.org/files/distro/bifrost-7.2.img.gz #Or download thru browser

• gzip -d bifrost-7.2.img.gz

• dd if=bifrost-7.2.img of=/dev/sdX bs=4096 Where X is the correct letter to the device :you want to burn to. You can check the lists of devices connected by running fdisk -l with admin rights.

Scaling the filesystem in case you wish to make usage of the entire space of your stick.

• parted /dev/sdX

• (parted) resizepart 1 -1s

• (parted) quit

• resize2fs /dev/sdX1

• fsck /dev/sdX1

• sync

62 A.2 Compile and install a kernel for bifrost

You probably want to have the same configuration as the original kernel provided through the image. You can copy the previous configuration with zcat /proc/config.gz > .config

• Assuming you downloaded and extracted the kernel code and you are currently in the folder, and you mounted the bifrost distribution on /media/user/bifrost

• wget http://jelaas.eu/pkg64/bifrost-initramfs-15.tar.gz

• tar xvf bifrost-initramfs-15.tar.gz ./boot/initramfs.cpio -O > initramfs.cpio

• make This should take a while.

• cp arch/x86/boot/bzImage /media/user/bifrost/boot/kernel-XXX

• make modules install XXXX

• sync

63 Appendix B

Scripts

Example of a simple Perl post-processing script to yield data for gnuplot. #!/usr/bin/perl-nw use List::Util qw(sum); BEGIN{$nbcore=1; $i=0; } if( grep /bps/,$_){ (my $pps) = $_=˜/(\d+?)pps/; $pps=˜s/\d{3}$//; push @res,$pps;

(my $bps) = $_=˜/(\d+)Mb\/sec/; push @res2,$bps; $i++; } if($i == $nbcore){ $i=0; print $nbcore++,"",sum(@res)/1000," Mpps"; print sum(@res2)/1000," Mbps\n"; @res=(); @res2=(); } ../pkt–dat.pl

64 65 Appendix C

Block diagrams

Figure C.1: Block diagram of motherboard Tyan S7002

66 Figure C.2: Block diagram of the motherboard S2600IP

67 Figure C.3: Block diagram of the motherboard S2600CW

68 --- pktgen_old.c 2016-06-04 15:47:00.881493623 +0200 +++ pktgen.c 2016-06-04 15:46:17.953491778 +0200 @@ -3447,10 +3447,8@@ pkt_dev->last_ok = 0; goto unlock; } - atomic_add(burst,&pkt_dev->skb->users); - -xmit_more: - ret= netdev_start_xmit(pkt_dev->skb, odev, txq, --burst > 0); + atomic_inc(&(pkt_dev->skb->users)); + ret= netdev_start_xmit(pkt_dev->skb, odev, txq, pkt_dev->sofar% burst != 0);

switch (ret){ case NETDEV_TX_OK: @@ -3458,8 +3456,6@@ pkt_dev->sofar++; pkt_dev->seq_num++; pkt_dev->tx_bytes += pkt_dev->last_pkt_size; - if (burst > 0 && !netif_xmit_frozen_or_drv_stopped(txq)) - goto xmit_more; break; case NET_XMIT_DROP: case NET_XMIT_CN: @@ -3478,8 +3474,7@@ atomic_dec(&(pkt_dev->skb->users)); pkt_dev->last_ok = 0; } - if (unlikely(burst)) - atomic_sub(burst,&pkt_dev->skb->users); + unlock: HARD_TX_UNLOCK(odev, txq);

Figure C.4: Patch proposed to fix the burst anomalous cloning behaviour

69 TRITA TRITA-ICT-EX-2016:118

www.kth.se