NVMe-Based Caching In Video-Delivery CDNs

Thomas Colonna 1901942 Double Degree INSA Rennes Supervisor: Sébastien Lafond Faculty of Science and Engineering Åbo Akademi University 2020 In HTTP-based video-delivery CDNs (content delivery networks), a critical compo- nent is caching servers that serve clients with content obtained from an origin server. These caches store the content they obtain in RAM or onto disks for serving additional clients without fetching them from the origin. For most use cases, access to the disk remains the limiting factor, thus requiring a significant amount of RAM to avoid these accesses and achieve good performance, but increasing the cost. In this master’s thesis, we benchmark various approaches to provide storage such as regular disks and NVMe-based SSDs. Based on these insights, we design a caching mod- ule for a relying on kernel-bypass, implemented using the reference framework SPDK. The outcome of the master’s thesis is a caching module leveraging specific proper- ties of NVMe disks, and benchmark results for the various types of disks with the two approaches to caching (i.e., regular filesystem based or NVMe-specific). Contents

1 Introduction 1

2 Background 3 2.1 Caching in the context of CDNs ...... 3 2.2 Performances of the different disk models ...... 4 2.2.1 Hard-Disk Drive ...... 4 2.2.2 Random-Access Memory ...... 5 2.2.3 Solid-State Drive ...... 6 2.2.4 Non-Volatile Main Memory ...... 6 2.2.5 Performance comparison of 2019-2020 storage devices ...... 6 2.3 Analysing ...... 7 2.3.1 Event processing ...... 7 2.3.2 Caching with Nginx ...... 7 2.3.3 Overhead ...... 8 2.3.4 Kernel bypasses ...... 8

3 Related Work 11 3.1 Kernel bypass for network ...... 11 3.2 Kernel bypass for storage ...... 11 3.3 Optimizing the existing access to the storage ...... 11 3.4 Optimizing the application ...... 12 3.5 Optimizing the CDN ...... 12

4 Technical Choices 13 4.1 First measurements ...... 13 4.2 Data visualization ...... 13 4.3 Disks performances comparison ...... 14

5 NVMe-specific caching module 21 5.1 Architecture of the module ...... 21 5.1.1 Asynchronous programming ...... 21 5.2 Integration with Nginx ...... 23 5.3 Experimental protocol ...... 24 5.4 Results ...... 26

6 Conclusion 27 6.1 Future work ...... 27 1. Introduction

Broadpeak is a company created in 2010, designing and providing components for con- tent delivery networks. The company focuses on content delivery networks for video content, like IPTV, Cable and on demand video. Broadpeak provides it services to con- tent providers and internet services providers like Orange. The classical ways of consuming videos and films, broadcasts and DVDs have be- come less popular since the emergence of services providing on-demand video via the internet. To provide this kind of service, some technologies have been created, such as the HLS [1] and DASH [2] protocols. These protocols were developed using norms and protocols already existing for the web, allowing video streaming to use a large part of the infrastructure created for the web. HLS and DASH contributed to popularizing the on- demand video streaming by allowing a large number of devices (smartphones, laptops, television...) to access the system and by creating a standard ensuring the interoperability between all the devices and servers in the network. Now, the focus is on creating systems able to manage the increasing number of users switching from traditional television to OTT (Over The Top) streaming, because of the advantages of this technology: choice, replay, time-shifting... The example of the con- tainment because of the COVID-19 shows the importance of having optimized and scal- able systems to be able to manage the huge demand for content. In the context of the video distribution, one key element is the Content Delivery Network (CDN). The role of the CDN is to deliver the content once the user is connected to it. Because video con- tent is huge (e.g., 1.8 GB for an hour of video delivered at 4 Mbps), the pressure on the CDN throughput is higher for video than for any other content. Note that this trend is reinforcing as shifting to large screens and Ultra HD means that video bitrate is doubled. In addition, video CDNs have stringent performance requirements. Indeed, a CDN must be able to store this huge amount of data and at the same time answer the requests of the users at a constant rate. The constant rate is important, because even the slightest fluctu- ation can lead to a pause in the reading for one user or more, which has a huge negative impact on the user experience. Video Streaming relies heavily on I/O operations. I/O operations are the bottleneck of

1 a Content Delivery Network (CDN). To increase the performance of the system, we can increase the number of servers in the CDN and, thus, its cost and energy consumption. Improving the I/O performances allows reducing the number of servers without losing performances, leading to less cost, and less energy consumption. Concerning the I/O, with the evolution of storage device technology, the software overhead, negligible until now, is becoming huge compared to the time spent in the disk, especially the overhead related to the isolation between the kernel and user-space. This isolation requires that every time a storage system is accessed, a context switch is done [3]. The POSIX API used to access files is also an element that slows down the process. In this master’s thesis, we analyze the performance of a streaming server using stan- dard modules of Nginx. We show the time distribution between the kernel and the disk, and compare it to a configuration where an NVMe disk is used to illustrate the impact of the overhead (Chapter 4). We present a new Nginx caching module using SPDK [4], aiming to eliminate this overhead when using NVMe disks (Chapter 5). The design of our module is similar to EvFS [5], but we specialized it for caching purposes, and we do not expose a POSIX API. We moved the API to interact with the storage device outside the kernel, reducing the overhead due to the file system API. We also designed it knowing the specific workload of the caching server to maximize the lifetime of the flash memory of the disks. There are several projects aiming to achieve a similar objective: Bluestore [6] and NVMeDirect [7] are two examples.

2 2. Background

2.1 Caching in the context of CDNs

A Content Delivery Network (CDN) is a network composed of servers located in various geographical places [8]. Its purpose is to quickly deliver content such as web pages, files, videos... The infrastructure of a CDN is conceived to reduce the network load and the server load, leading to an improvement of the performances of the applications using it [9]. The infrastructure of a CDN heavily relies on the principle of caching. Web caching is a technology developed to allow websites and applications to face the increasing number of users of the internet [11, 10, 12, 13]. Web caching can reduce the bandwidth consump- tion, improve the load-balancing, reduce network latency and provide higher availability. To achieve caching we need at least two servers, the first one, the origin and the second one the caching server. The principle is that when a user requests some content to the server, the request first goes through the caching server to check if the content demanded is already in the . If it is, the caching server provides the content to the user. If not, the caching server requests the content to the origin server that stores the data, provides it to the user and keeps a copy of the content for the next time someone requests this content, as illustrated in figure 2.1. Using the same principle, a CDN is composed of one or several origin servers which store the content to be delivered, and multiples caching all around the geographical area it delivers, as shown in figure 2.2. When a user requests a content to the CDN the request goes through the closest caching server. Thus, the latency is greatly reduced, because the data does not need to travel all around the world, and the workload is spread between all the servers. This infrastructure is largely used by the industry because of its efficiency, its relia- bility and its performances. However, in some cases, this is not enough, like in a video CDN. In the context of a video CDN, objects are large, so the request payload is larger and the number of requests is reduced. Since there are fewer requests to serve, the load of

3 Figure 2.1: caching diagram. the CPU is reduced. Because the content to deliver is larger, a video CDN requires more storage capacity. To improve the performances of these systems we can leverage on the technology used to store the data.

2.2 Performances of the different disk models

2.2.1 Hard-Disk Drive

The Hard-Disk Drive (HDD) is a permanent storage technology made of rotating magnetic disks that store the information. To read or write data with this kind of storage, the disk needs to place the reading head on the correct location and start to read the data. The main advantage of this kind of storage is that it is a well-known technology and we can produce devices that can store a huge amount of the information. HDDs have the best capacity/price ratio making it appealing for use in caching servers. However, due to the multiple steps required to locate the data before reading or writing on the disk, this technology has a big latency. The latency is even grater where the accessing small files, in a video CDN context this could be the video manifests or the subtitle files. The limited number of reading head installed on a HDD increase the latency of concurrent access to the data stored on it. Before the popularization of the technologies presented below, some techniques to

4 Figure 2.2: CDN diagram. improve the performance of the HDD were developed. Some strategies of writing on a disk are more performing than others: it is more efficient to split the writing onto several disks than writing it on one disk [14]. The main inconvenience of this technique is that it requires more disks and the capacity of storage is not increased. This principle is used in the technology RAID (Redundant Array of Independent Disks). It requires a large number of disks and greatly improves all the performances of a storage system: speed, reliability, and availability. However, this system is still less performing than the other systems presented below.

2.2.2 Random-Access Memory

The Random-Access Memory (RAM) is a kind of data storage that is used to store "work- ing data". It is mainly used for this purpose because it is a volatile memory storage, mean- ing that when the power supply of the system is shut off, all the data stored in the RAM is lost. In the context of a caching server, since the server is supposed to be constantly run-

5 ning this is not a real problem, and a backup policy can be implemented, to periodically make a copy of all the data stored in the RAM on a more reliable storage device. The interest in storing data in the RAM is the extremely low latency. This is the fastest memory technology (excluding CPU cache). On a caching server delivering video to a large number of clients, the performance gain of storing the data in the RAM instead of storing it on HDDs is crucial. This is why some server models have terabytes of RAM. Although it is performing, it is also orders of magnitude more expensive than HDDs, ass showed in table 2.1.

2.2.3 Solid-State Drive

Solid-State Drives (SSD) are a new type of storage device. They are composed of elec- tronic components like NAND gates. Unlike the RAM, SSDs are non-volatile memory storage. They are still slower to access than the RAM, but they are orders of magnitude faster than HDDs. Now SSDs are so fast that the protocol used to interact with the disks is the bottleneck. To interact with storage disks a computer must use a protocol supported by the disks. Today there are multiple protocols. The most used are the Serial AT Attachment (SATA) and the Serial Attached SCSI (SAS). These protocols were designed with the slow Hard- Disk Drive (HDD) in mind. To use the full potential of SSDs, a new protocol has been created: the Non-Volatile Memory Host Controller Interface Specification or NVM Ex- press (NVMe). This protocol requires that the disk is connected to the computer with a PCI Express (PCIe). It allows more efficient access to the data: better speed, better par- allel access. With this new protocol, SSDs have great performances and are cheaper than the RAM. Since then, some hardware constructors have made significant progress and have cre- ated disks that we can consider as Ultra-Low Latency SSDs [15, 16]. These disks are designed to use the maximum of the NVMe protocol.

2.2.4 Non-Volatile Main Memory

Non-Volatile Main Memory (NVMM) is a design of storage devices composed of flash memory, like SSDs, connected to the computer through the DIMM ports. One example of this kind of storage is the Intel Optane DIMM [17].

2.2.5 Performance comparison of 2019-2020 storage devices

Table 2.1 shows the characteristics and performances of eight NVMe SSDs compared to standard RAM. Even thought the RAM has the best performances, it comes with a high

6 cost. In order to create a caching server with a large capacity, NVMe SSD are a viable solution thanks to their price and capacity.

2.3 Analysing Nginx

Nginx is an asynchronous web server. Two large CDN and related operators have publicly claimed that they use Nginx (i) Cloudflare, and (ii) Netflix. It is also used as the load- balancer for HTTP service in Kubernetes. It can fulfill multiples roles: web server, , load balancer, mail proxy, and HTTP cache.

2.3.1 Event processing

Nginx uses a multi-process pattern where there is a master process controlling worker processes. This allows reloading the configuration of the server without interruption of the service. Nginx is asynchronous because each request made to the server creates an event that is posted in the list of events, and the event loop picks one event and starts processing the request associated. The pseudo-code of the event loop is presented in figure 2.3. while (true) { next_timer = get_next_timer(); wait_for_io_events(XX, next_timer - now); process_timer_event(); process_list_event(); }

Figure 2.3: Pseudo code of the event loop of Nginx

The main advantage of this model of programming is that requests are non-blocking. Whenever a request is waiting for further I/O (reading or writing from disk or network), it does not block the process so that other requests can be processed in the meantime, minimizing the CPU time spent doing nothing.

2.3.2 Caching with Nginx

Nginx is often used to make a caching server. The standard cache implemented in Nginx uses the abstraction of a file system to store the data to cache. The cache can be stored on any storage device connected, even the RAM via a RAMFS (a file system located in

7 the RAM of the computer). To access the cache, Nginx must use a system call to ask the kernel to retrieve the data. The goal of this cache is to be deployed everywhere, to achieve that it does not use any special properties of the hardware of the storage device. Since Nginx is modular, some modules implement different caching mechanisms. For instance, the module "nginx_tcache" is a module developed for tengine [18], a fork of Nginx. This module uses a cache located in the RAM. It stores the data to cache in a large RAM segment allocated at the start of the server. Since the cache is stored in RAM, whenever the server is shut down, all the data stored in the cache is lost. The same drawback applies to the standard cache of Nginx if we choose to use the RAMFS.

2.3.3 Overhead

The role of the operating system is to provide an abstraction of the hardware, to allow applications to run without worrying about the hardware they run on. This abstraction provides an API usable by any program launched on the computer. On UNIX-based systems this API is called POSIX and is used by almost every software developed for UNIX systems. However, when a software needs to use a specific feature of a hardware component or it needs to have faster access to the memory or to the network, this API is not efficient. Mostly because it is too generic to implement special features. Each time we want to access a file on a disk or send a request on the network, the program executes a system call. It calls a function provided by the kernel of the operating system. This call requires a context switch from the user space, the area of the memory the applications can use, to the kernel space, the area of the memory the kernel of the op- erating system and its extensions use. This context switch is heavy on the CPU, especially if it is done frequently. A large number of cycles are lost because of it. When a program executes a system call, the calling thread initializes some registers. Then it is suspended to allow the kernel to execute the system call. Once the call is finished, the kernel restores the context of the suspended thread where it was stopped and then resumes the execution of the thread. If the system call is an I/O operation, the suspension of the thread can be long, leading to a significant loss of performances.

2.3.4 Kernel bypasses

Some use cases do not need the abstraction of the operating system and are considerably slowed down by multiple context switches, for instance when an application has an in- tensive use of the network. To improve the performances, one of the solutions available is to ignore the kernel. The application directly uses the hardware it needs without ask-

8 ing the kernel, therefore there is no context switch, leading to a significant improvement of the performances of the application. Software with intensive use of the network fre- quently uses this solution. Open vSwitch [19], a software switch, uses DPDK to improve its performances. Without using the kernel API to access the network or the storage disks, the applica- tion has better performances, but it is more difficult to develop because everything must be handled by it, even the security, usually handled by the kernel.

9 Throughput Endurance Normalized Throughput Normalized Capacity Model Cost IOPs Latency (µs) Connectivity (GBps) (PBW) (GBps/kC) (GB/C) RAM 1600C 21.3 - - 0.08 DIMM 13.31 0.08 128 GB DDR4-2666 HDD 150C 0.19 - - 4000 SATA 1.26 13.65 2 TB SSD D3-S4510 0.56 (read) 97000 (read) 520C 7.1 36 SATA 1.07 3.78 1.92 TB 2.5" 0.51 (write) 35500 (write) NVMe SSD (P4610) 3.2 (read) 643000 (read) 77 (read) 760C 12.25 PCIe 4.21 2.15 1.6 TB 2.5" 2.08 (write) 199000 (write) 18 (write) NVMe SSD (P4510) 3.2 (read) 637000 (read) 77 (read) 544C 2.61 PCIe 5.88 3.76 2 TB 2.5" 2.0 (write) 81500 (write) 18 (write) NVMe SSD (PM1725b) 3.5 (read) 720000 (read) 650C 8.76 - PCIe 5.38 2.52 1.6 TB 2.5" 2.0 (write) 135000 (write) 10 NVMe SSD (PM1725b) 5.4 (read) 750000 (read) 660C 8.76 - PCIe 8.18 2.48 1.6 TB HHHL 2.0 (write) 135000 (write) NVMe Optane (P4800X) 2.5 (read) 550000 (read) 5270C 164 10 PCIe 0.47 0.29 1.5 TB 2.5" 2.2 (write) 550000 (write) NVMe Optane SSD 900P 2.5 (read) 550000 (read) 500C 5.11 10 PCIe 5.00 0.56 280 GB 2.5" 2.0 (write) 500000 (write) DIMM Optane (Apache Pass) 6.8 (read) 520C - 292 PBW 0.2 DIMM 13.08 0.24 128 GB PMM 1.85 (write) DIMM Optane (Apache Pass) 5.3 (read) 6000C - 300 PBW 0.2 DIMM 0.88 0.08 512 GB PMM 1.89 (write)

Table 2.1: Price and performances of storage technologies 3. Related Work

3.1 Kernel bypass for network

One of the first uses of kernel bypass is in networking. The DPDK (Data Plane Develop- ment Kit) project is a popular project used to achieve kernel bypass for networking. The F-Stack project is also frequently use for kernel bypass for networking.

3.2 Kernel bypass for storage

This project of using kernel bypass to store data is not the first one. Ceph’s Bluestore [6] project is one example. However, its main goal is not to do memory cache, but perma- nently store data. EvFS [5] is also a project of kernel bypass for storage. The goal of this project is to create a POSIX API in the user space, meaning that a program that needs access to the data on an NVMe Drive does not need to make a system call to retrieve the data, eliminating the overhead. Exposing a POSIX API makes it easier to integrate into pre- existing systems. The project NVMeDirect [7] aims to give, to the applications, direct access to the NVMe SSD, to improve the performances of the applications. With this project, other applications can still use the default I/O stack to access the storage device, whereas the SPDK framework claims the whole device and it becomes impossible to access the storage device without using SPDK. The project also provides different features aimed to make the interaction with the storage device easier and more flexible.

3.3 Optimizing the existing access to the storage

The I/O stack and the abstractions developed with it were created when the storage devices were slow compared to the CPU. Some projects are trying to modernize the existing I/O stacks, of the kernel, to use the full potential of the new storage technologies.

11 Enberg, Rao, and Tarkoma proposed a new structure for the operating system. They called this structure, the parakernel [20]. The role of this structure is to partition the hard- ware of the computer to allow different applications to use it at the same time. Thanks to this separation, it is easier to parallelize the access to I/O devices. With this structure, a program that needs some resources will ask the parakernel for the access, and the parak- ernel will isolate a part of the device and give access to the program. Now, the program can directly interact with the device, without involving the kernel. Lee, Shin, Song, Ham, Lee, and Jeong are creating a new I/O stack to use with NVMe devices. This I/O stack is optimized to use low latency NVMe SSDs like the Intel Optane [16] or the Samsung Z-SSD [15]. They are not creating a new API, they just change the implementation of some functions of the POSIX standard to use the advantages of the hardware. To improve the performances, they developed a special block I/O layer, lighter and faster but only usable with NVMe SSDs. They also overlapped/parallelized the operations that create the structures necessary to store the request result with the data transfer from the device to the memory, and they implemented several lazy mechanisms.

3.4 Optimizing the application

There is also a project of a fast key-value store using NVMe SSDs called KVell [22]. This project focuses on the optimization of the application, and not of the I/O stack, to improve its performances and benefit from the advantages of the specific hardware. To achieve that, KVell used a share nothing approach. Their goal is to minimize shared states between threads, to avoid the synchronization overhead. They also reduce the number of I/O system calls to reduce the number of context switches. This project uses existing tools in a different way to adapt to new technologies. The project uDepot [23] is also a key-value storage application. This project provide a programmer friendly interface to develop application using storage on NVMe SSDs. It is also used in an experimental implementation of a cloud cache service.

3.5 Optimizing the CDN

This thesis is centered on optimizing the application used by the CDN to deliver the content. But there is also other ways to improve the performances and the cost of the CDN. One of this methods is to carefully plan how the CDN is deployed to minimize the number of useless equipment. To do so we can use the CaDeOp (Cache Deployment Optimization) problem [24].

12 4. Technical Choices

This chapter presents the results of the performance tests of the existing solutions. This will serve as reference to evaluate the performance of the module we will develop. These measures are done by requesting 1MiB files to the server.

4.1 First measurements

Cache Technology RAM SSD HDD Throughput (GB/s) 1.47 1.04 0.12 Average latency (ms) 22 101 500 Max latency (ms) 300 1950 2000 Table 4.1: Results of the benchmarks of the existing solutions

Table 4.1 presents the results of the measures done to evaluate some existing technolo- gies for caching. In this table, the throughput is the throughput of the server, measured on the client side, and the latency is the time spent between the request and the response to this request. The ones tested here are: a file system in RAM, a file system in SSD and a file system in HDD. These results show that the cache using the hard-drive disk is the slowest and does not correspond to the usage of a caching server. The performance of the SSD is much closer to the one of the RAM, than the one of the HDD. However, the latency is higher, indicating more irregular performances than RAM. This invite observation to reconsider the split between RAM and storage for caching.

4.2 Data visualization

To visualize the results of the benchmarks, of the different caches setups, we use a type of graph called "Flamegraphs" [25]. This kind of graph shows the time spent in each func- tion. With an Off-CPU analysis we can draw a graph that shows the time the CPU spent waiting for Input/Output operations. With the combined graph we can see the proportion of waiting time and computing time.

13 On a Flamegraph, each block represents a call to a function. The width of the block represents the proportion of time spent in this function. If the width of the block is equal to the width of the graph, this means that 100% of the time was spent in this function. The stack of blocks represents the stack of function calls.

Figure 4.1: Flamegraph example

Figure 4.1 shows a simplified Flamegraph. In this example, 100% of the time is spent in the function foo. Above the block representing the function foo, there is a block representing the function bar, meaning that the function bar is called by the function foo. The width of the "bar" block is half the width of the "foo" block. We can deduce that half of the time spent in the function foo is spent in the call to the bar function. With this representation, we can easily see where the program spent the longest time, and what action took the most time. There is also a Flamegraph where the time spent waiting for an I/O event is repre- sented. Figure 4.2 is an example of these types of Flamegraph. In this document "On- CPU" time, the time the CPU spent calculating, will always be represented in shades of yellow/orange, and "Off-CPU" time, the time the CPU spent waiting for events, will always be represented in shades of blue.

4.3 Disks performances comparison

Figure 4.3 shows the time spent by the CPU waiting for the access to the disk to be finished. Compared to the time spent calculating, the wait time is huge. With this infor- mation, and the results presented in section 4.1, we can say that hard-drive disks are not a viable solution for a high-performance caching server. With the configuration with the Solid-State Drive, results in figure 4.4, the proportion of time spent waiting to access the disk is similar to the configuration with the HDD. However since the average latency 4.1 is smaller, the server can answer more requests.

14 Figure 4.2: Flamegraph with off-cpu time

Figure 4.3: Flamegraph hdd http

The configuration with the RAM file system is the most efficient (figure 4.5). It spent less time waiting for storage to respond than the two other configurations. Using the Flamegraphs and the table 4.1, we can conclude that the RAMFS and the SSD are the ones suitable for a performing caching server.

15 Figure 4.4: Flamegraph ssd http

When using the HTTPS protocol, the performances of these two configurations is similar. Figures 4.6 and 4.7 show the wait time and the computing time with the HTTPS protocol. With the kernel bypass it is possible to reduce the proportion of time spent waiting for the access to the storage in the SSD configuration, and make it comparable to the proportion of time spent waiting with the RAMFS configuration.

16 Figure 4.5: Flamegraph ramfs http

17 Figure 4.6: Flamegraph ssd https

18 Figure 4.7: Flamegraph ramfs https

19 20 5. NVMe-specific caching module

5.1 Architecture of the module

The caching module is composed of two different parts. The first part is completely independent of Nginx, and can be compiled alone to simplify the development of the module. This part is responsible of the interaction with the disk. The data is stored on the disk and the metadata are stored in the RAM. The other part of the module is the part in charge of the communication with Nginx. This part is as little as possible, and largely inspired by the "tcache" module.

5.1.1 Asynchronous programming

Because of the architecture of Nginx and SPDK are asynchronous, the architecture of the caching module must be asynchronous. Nginx uses this to augment the number of requests handled at the same time. When a request arrives, the treatment is immediately started and is non-blocking, allowing Nginx to receive another request. Interacting with a storage device take time, even if the disk is really fast. To avoid blocking the application, the architecture of the SPDK framework is asynchronous. Asynchronous programming adds some difficulties to understand the code of the pro- gram and its flow of execution, because we know when a function is called but not when it will return a value. Figure 5.1 shows the steps of the treatment of a request. When a request arrives, the program checks if the data requested is present on the disk. To do so it uses the metadata stored in the RAM, so this step can be done synchronously. From here there are two possibilities, either it is present or not. In the first case, the program answers the request with the data from the disk. This is done asynchronously because it involves Nginx and SPDK. In the case where the data is not on the disk, it will request the data to the upstream server. Since we do not know how much time this step will take, and it must not block the

21 program, this is done asynchronously. When the data is received, the server will send it to the client that requested it (this can be done while it receives the data from the upstream server due to the architecture of Nginx). Once all the data is gathered it will be stored onto the disk.

Figure 5.1: Flow of a request treatment.

As shown in figure 5.2, to store on the disk, the program needs to check if the disk has free space; if not it will select which file to delete. This is done synchronously using the metadata. Next, all the steps are done asynchronously. It deletes the files marked and writes the data to store on the disk.

22 Figure 5.2: Flow of the data storage.

5.2 Integration with Nginx

To interact with the network and the storage devices, Nginx uses system calls based on the principle of kernel interruption. The kernel stops the program, does what is requested by the program and relaunches the execution of the program where it was stopped. The SPDK framework uses a different approach to interact with the storage devices. Since the objective of SPDK is to allow programs running in the user space to access the storage device without involving the kernel, the principle of kernel interruption is not applicable. To solve this, the SPDK framework uses a polling model. The framework periodically checks if the action performed on the device is finished. To integrate the caching mod- ule using SPDK to Nginx, we integrated a polling mechanism to allow SPDK to work properly.

23 5.3 Experimental protocol

The experiment will be executed on a server with the following configuration: two Intel® Xeon® Gold 5218 CPUs at a frequency of 2.30GHz [26], 188GiB of RAM and Samsung PM1725b NVMe SSD [27]. The performances of the module developed in chapter 5 are compared to different configurations of caching with Nginx. The list of the configurations tested is the follow- ing:

• Nginx with a RAM FS: the cache is stored into a file system stored in the RAM.

• Nginx with tcache module: tcache provides a different RAM FS than a standard Nginx.

• Nginx with two disks in Raid 0: the cache is stored into a file system stored in two SSDs mounted in Raid 0.

• Nginx with an FS on an NVMe drive: the cache is stored into a file system stored in an NVMe SSD.

• Nginx with SPDK: the module developed in chapter 5. The cache is stored directly in the NVMe drive without any file system.

To measure the performance of the different configurations two virtual machines are configured. On the first one, we execute the caching server with one of the configurations above. It is executed on two cores of one of the CPUs available on the server. On the second virtual machine, we execute a benchmark tool called WRK2 [28]. It is executed on all the cores of the second processor. This is done this way to be able to saturate the other virtual machine and see how it behaves. Figure 5.3 shows how the different components of the server are connected. A few things to notice are the virtual machine with the Nginx server is executed on the processor on the socket connected to the PCI port where the NVMe drive is. Both virtual machines have. As explained in section 5.2, Nginx was designed using the kernel interruptions to interact with the network and the storage devices. With the caching module, Nginx also uses the polling mechanism to interact with the NVMe disks. The interaction of both methods of handling the I/O operations is not optimal. To solve this and achieve better performances, the solution is to use on direct access to the hardware of the network card to which they are connected (PCI pass-through). The network cards can handle a 25GB/s throughput. To have a measure coherent in the context of video distribution, the workload needs to have a specific format. The experiment is done with each one of the following:

24 Figure 5.3: experimental setup diagram.

• Large object format: it is the format used for Mpeg2TS [1]. Each segment is com- posed of one file containing the audio and the video.

• DASH [2]-like format: each segment is composed of three files, one large con- taining the video, one medium containing the audio and one small containing the subtitles. This last file is not sent to every user, for the experiment we consider that only 10% of the users request it.

• HLS [1]-like format: each segment is composed of five files, one large containing the video, one medium containing the audio, one small containing the subtitles and two small containing the manifest describing how the next segment is named (one for the audio and one for the video). For the subtitles, we also consider that only 10% of the users request it.

• Low-latency HLS [1]-like format: each segment is composed of five files, one medium containing the video, one small-medium containing the audio, one small containing the subtitles and two small containing the manifest describing how the next segment is named (one for the audio and one for the video). For the subtitles, we also consider that only 10% of the users request it.

The difference between the size of the files of the HLS and Low latency HLS is caused by the length of the segments. An HLS segment is 2 seconds long and a Low latency HLS is 500 milliseconds long. With the same video bitrate, the protocols will not need the same file sizes.

25 The DASH protocol does not need any manifest because when the transmission starts, a description of the names of the segments is provided. For instance: video files are named with patterns like: "Video_XXXX", and audio files are named with patterns like: "Audio_XXXX". XXXX is a number that increments every two seconds (the length of a segment). With this description, there is no need for a manifest. To do the measures we also need an access pattern to the data. Since Broadpeak has a strong focus on live broadcasting, where all the clients are consuming the same content at the same time, all the responses of the requests are read from the cache.

5.4 Results

Due to the COVID-19 pandemic the experiment could not be done, therefore there is no results to present.

26 6. Conclusion

This master’s thesis presented a new NVMe-based video caching system/module for Ng- inx. It uses the framework SPDK to achieve a kernel bypass to improve the efficiency of the module. The overhead to access the memory is significantly reduced, allowing less CPU consumption and the full utilization of the storage devices. This cache module is still less performing than a cache module designed with a file system in the RAM. However, NVMe SSDs are cheaper than RAM.

6.1 Future work

This project is just a prototype. To be able to use it in a production environment some work still need to be done. First, the cache must implement a caching policy to acquire new video files. This cache policy must be designed to maximize the durability of the NVMe SSD, to avoid changing them to often. Reminder that the goal of this project is to reduce the cost of the caching server. If it burns through SSDs to fast, it will not achieve its goal. Some work also need to be done with the way the network is handled. As explained in section 5.2, Nginx was designed using the kernel interruptions to interact with the network and the storage devices. With the caching module, Nginx also uses the polling mechanism to interact with the NVMe disks. The interaction of both methods of handling the I/O operations is not optimal. To solve this and achieve better performances, the solution is to use only one of the methods. Since SPDK uses the polling mechanism and is based on the DPDK framework designed to achieve kernel bypass for the network, a solution could be to update Nginx to use DPDK. Some projects, like F-Stack, already implemented a kernel bypass for the network in Nginx and could serve as examples. With the current implementation of the caching module, multi-process usage is not possible. To increase the performances and the scalability, allowing the caching module to use multiple process is very important.

27 28 References

[1] R. Pantos and W. May, RFC 8216 - HTTP Live Streaming. [Online]. Available: https://tools.ietf.org/html/rfc8216. [2] International Organization for Standardization, ISO/IEC 23009-1:2019 Informa- tion technology — Dynamic adaptive streaming over HTTP (DASH) — Part 1: Media presentation description and segment formats. [3] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Ander- son, and T. Roscoe, “Arrakis: The operating system is the control plane,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), Broomfield, CO: USENIX Association, Oct. 2014, pp. 1–16, ISBN: 978-1-931971- 16-4. [Online]. Available: https://www.usenix.org/conference/osdi14/ technical-sessions/presentation/peter. [4] Z. Yang, J. R. Harris, B. Walker, D. Verkamp, C. Liu, C. Chang, G. Cao, J. Stern, V. Verma, and L. E. Paul, “Spdk: A development kit to build high performance storage applications,” in 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Dec. 2017, pp. 154–161. [5] T. Yoshimura, T. Chiba, and H. Horii, “Evfs: User-level, event-driven file system for non-volatile memory,” in 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19), Renton, WA: USENIX Association, Jul. 2019. [Online]. Available: https://www.usenix.org/conference/hotstorage19/ presentation/yoshimura. [6] A. Aghayev, S. Weil, M. Kuchnik, M. Nelson, G. R. Ganger, and G. Amvrosiadis, “File systems unfit as distributed storage backends: Lessons from 10 years of ceph evolution,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, ser. SOSP ’19, Huntsville, Ontario, Canada: ACM, 2019, pp. 353–369, ISBN: 978-1-4503-6873-5. [Online]. Available: http://doi.acm.org/10.1145/ 3341301.3359656. [7] H.-J. Kim, Y.-S. Lee, and J.-S. Kim, “NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs,” in 8th USENIX Work- shop on Hot Topics in Storage and File Systems (HotStorage 16), Denver, CO:

29 USENIX Association, Jun. 2016. [Online]. Available: https://www.usenix. org/conference/hotstorage16/workshop-program/presentation/kim. [8] J. Dilley, B. Maggs, J. Parikh, H. Prokop, R. Sitaraman, and B. Weihl, “Globally distributed content delivery,” IEEE Internet Computing, vol. 6, no. 5, pp. 50–58, Sep. 2002, ISSN: 1941-0131. [Online]. Available: https://doi.org/10.1109/ MIC.2002.1036038. [9] E. Nygren, R. K. Sitaraman, and J. Sun, “The akamai network: A platform for high-performance internet applications,” SIGOPS Oper. Syst. Rev., vol. 44, no. 3, pp. 2–19, Aug. 2010, ISSN: 0163-5980. [Online]. Available: https://doi.org/ 10.1145/1842733.1842736. [10] M. Abrams, C. R. Standridge, G. Abdulla, S. Williams, and E. A. Fox, “Caching proxies: Limitations and potentials,” Department of Computer Science, Virginia Polytechnic Institute & State . . ., Tech. Rep., 1995. [Online]. Available: http : //hdl.handle.net/10919/19946. [11] J. Wang, “A survey of web caching schemes for the internet,” SIGCOMM Comput. Commun. Rev., vol. 29, no. 5, pp. 36–46, Oct. 1999, ISSN: 0146-4833. [Online]. Available: https://doi.org/10.1145/505696.505701. [12] H. Yu, L. Breslau, and S. Shenker, “A scalable web cache consistency architecture,” in Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, ser. SIGCOMM ’99, Cambridge, Mas- sachusetts, USA: Association for Computing Machinery, 1999, pp. 163–174, ISBN: 1581131356. [Online]. Available: https://doi.org/10.1145/316188.316219. [13] G. Barish and K. Obraczke, “ caching: Trends and techniques,” IEEE Communications Magazine, vol. 38, no. 5, pp. 178–184, May 2000, ISSN: 1558-1896. [Online]. Available: https://doi.org/10.1109/35.841844. [14] O. Garrett, Cache Placement Strategies for NGINX and NGINX Plus. [Online]. Available: https://www.nginx.com/blog/cache-placement-strategies- nginx-plus/. [15] Samsung, Ultra-Low Latency with Samsung Z-NAND SSD. [Online]. Available: https : / / www . samsung . com / us / labs / pdfs / collateral / Samsung _ Z - NAND_Technology_Brief_v5.pdf. [16] Intel, Breakthrough performance for demanding storage workloads. [Online]. Avail- able: https : / / www . intel . com / content / dam / www / public / us / en / documents/product-briefs/optane-ssd-905p-product-brief.pdf.

30 [17] ——, [Online]. Available: https://www.intel.com/content/www/us/en/ architecture-and-technology/optane-dc-persistent-memory.html. [18] Tengine. [Online]. Available: https://github.com/alibaba/tengine. [19] Open vswitch. [Online]. Available: https://github.com/openvswitch/ovs. [20] P. Enberg, A. Rao, and S. Tarkoma, “I/o is faster than the cpu: Let’s partition re- sources and eliminate (most) os abstractions,” in Proceedings of the Workshop on Hot Topics in Operating Systems, ser. HotOS ’19, Bertinoro, Italy: ACM, 2019, pp. 81–87, ISBN: 978-1-4503-6727-1. [Online]. Available: http : / / doi . acm . org/10.1145/3317550.3321426. [21] G. Lee, S. Shin, W. Song, T. J. Ham, J. W. Lee, and J. Jeong, “Asynchronous i/o stack: A low-latency kernel i/o stack for ultra-low latency ssds,” in 2019 USENIX Annual Technical Conference (USENIX ATC 19), Renton, WA: USENIX Asso- ciation, Jul. 2019, pp. 603–616, ISBN: 978-1-939133-03-8. [Online]. Available: https://www.usenix.org/conference/atc19/presentation/lee-gyusun. [22] B. Lepers, O. Balmau, K. Gupta, and W. Zwaenepoel, “Kvell: The design and im- plementation of a fast persistent key-value store,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, ser. SOSP ’19, Huntsville, Ontario, Canada: ACM, 2019, pp. 447–461, ISBN: 978-1-4503-6873-5. [Online]. Available: http://doi.acm.org/10.1145/3341301.3359628. [23] K. Kourtis, N. Ioannou, and I. Koltsidas, “Reaping the performance of fast NVM storage with udepot,” in 17th USENIX Conference on File and Storage Technolo- gies (FAST 19), Boston, MA: USENIX Association, Feb. 2019, pp. 1–15, ISBN: 978-1-939133-09-0. [Online]. Available: https://www.usenix.org/conference/ fast19/presentation/kourtis. [24] S. Hasan, S. Gorinsky, C. Dovrolis, and R. K. Sitaraman, “Trade-offs in optimizing the cache deployments of cdns,” in IEEE INFOCOM 2014 - IEEE Conference on Computer Communications, 2014, pp. 460–468. [25] B. Gregg, “Visualizing performance with flame graphs,” Santa Clara, CA: USENIX Association, Jul. 2017. [26] Intel® Xeon® Gold 5218 Processor. [Online]. Available: https://ark.intel. com/content/www/us/en/ark/products/192444/intel-xeon-gold-5218- processor-22m-cache-2-30-ghz.html. [27] Samsung PM1725b NVMe SSD. [Online]. Available: http://image-us.samsung. com/SamsungUS/PIM/Samsung_1725b_Product.pdf. [28] Wrk2. [Online]. Available: https://github.com/giltene/wrk2.

31 [29] B. Walker and J. Harris, 10.39M Storage I/O Per Second From One Thread, 2019. [Online]. Available: https://spdk.io/news/2019/05/06/nvme/. [30] I. Zhang, J. Liu, A. Austin, M. L. Roberts, and A. Badam, “I’m not dead yet!: The role of the operating system in a kernel-bypass era,” in Proceedings of the Workshop on Hot Topics in Operating Systems, ser. HotOS ’19, Bertinoro, Italy: ACM, 2019, pp. 73–80, ISBN: 978-1-4503-6727-1. [Online]. Available: http:// doi.acm.org/10.1145/3317550.3321422. [31] C. Wu, J. M. Faleiro, Y. Lin, and J. M. Hellerstein, “Anna: A kvs for any scale,” 2018. [Online]. Available: https://dsf.berkeley.edu/jmh/papers/anna_ ieee18.pdf. [32] R. Kadekodi, S. K. Lee, S. Kashyap, T. Kim, A. Kolli, and V. Chidambaram, “Splitfs: Reducing software overhead in file systems for persistent memory,” in Proceedings of the 27th ACM Symposium on Operating Systems Principles, ser. SOSP ’19, Huntsville, Ontario, Canada: ACM, 2019, pp. 494–508, ISBN: 978-1-4503- 6873-5. [Online]. Available: http : / / doi . acm . org / 10 . 1145 / 3341301 . 3359631.

32