Evaluating and Improving Kernel Stack Performance for Datagram Sockets from the Perspective of RDBMS Applications

Evaluating and improving kernel stack performance for datagram sockets from the perspective of RDBMS applications Sowmini Varadhan, Tushar Dave Oracle Corporation, Redwood City, CA fsowmini.varadhan, [email protected] Abstract In addition to the actual numbers themselves, interest- ing points that emerged during our investigation were the Applications implementing Relational Database Management gaps between micro-benchmarks and real-world usage, criti- Services (RDBMS Applications) use stateless datagram sock- cal features that directly impacted real workloads, and practi- ets such as RDS[9] and UDP[8]. These workloads are typical factors that impacted deployment for any solution. cally highly CPU-bound and sensitive to network latency, so any performance enhancements to these networking paths is The remainder of the paper is organized as follows. We attractive for RDBMS workloads. We share some findings first describe two types of environments commonly encoun- from our benchmarking experiments using the Linux kernel for tered in RDBMS deployments that are highly sensitive to net- datagram based networking sockets in RDBMS applications work latency and the factors affecting performance in these and discuss potentials for improving in-stack performance us- environments. We provide an overview of the application ing socket types such as PF PACKET. Socket types such as constraints on APIs and tuning parameters offered by per- PF PACKET allow some benefits such as shared memory be- formance accelerating schemes in these environments. The tween user and kernel, and offer the benefits of a streamlined Oracle clustering environment currently uses UDP[8] and data-path with minimal data-copy. However, incorporating RDS[9] sockets. We are currently evaluating the usage of these methods into user-space software libraries for Database PF PACKET sockets in the Cluster. We describe the micro- applications must satisfy existing API constraints which we describe in this paper. benchmark and test-suites used for this evaluation and share the current results from our benchmarking effort as well as As part of this effort we also gained some insights into generic ongoing work in this space. Finally, we share some thoughts performance analysis methods and the restrictions imposed by real-world workloads that are running multiple workloads with for ways to improve Linux Kernel stack latency that are cur- varying packet processing profiles. We share some of the in- rently under investigation. sights gained in this paper. Latency sensitive use-cases encountered in RDBMS Keywords There are two types of use-cases in RDBMS environments RDBMS, PF PACKET, UDP, Benchmarking that are sensitive to network latency. 1. Cluster applications offering services in a distributed com- Introduction puting environment. These services are CPU-bound, request-response transactions involving UDP flows that Relational Database Management Service environments are can be clearly identified by a 4-tuple. composed of a mix of applications, many of which involve transaction based processing over the network. The services 2. Extract Transform Load (ETL) [2]. Here the input comes offered by these applications tend to be highly CPU-bound. in as raw data in JSON or comma-separated values. The The performance challenge in this environment is to deal with Compute Node converts the input to a Relational Database a large volume of network I/O in an efficient manner. format that is stored to disk. The conversion to the Re- At the same time, since I/O for these RDBMS applications lational Database format is a CPU-intensive transform comes from various sources (network, disk, local file-system that preserves all the information while compressing the and NFS), APIs for performance enhancements are also a amount of data to be stored to the disk. The challenge critical factor. in these environments is to find the right balance between Motivated by these goals and constraints, we have in- CPU cycles needed for the transform itself, while also vestigated a few kernel alternatives to UDP/IP, such as keeping up with the input rate coming in over the network PF PACKET. Our investigation has used micro-benchmarks A notable aspect of these environments is that the perfor- such as netperf, and there is an ongoing effort into using mance critical flows involve packets sized close to, or larger PF PACKET usage in Inter Process Communication (IPC) li- than, the MTU of the link. For example, the Distributed Lock braries for actual transaction-oriented RDBMS workloads. Manager is a typical Cluster service the performance-critical traffic involves client requests of 512 bytes, with responses The applicability of each of these techniques to the LMS en- that are usually 8192 bytes. Although the complexity of im- vironment is discussed below. proving network latency of small (64 byte) packet flows tends I/O batching Batching of network input allows the applica- to be the focus of the common benchmarking and perfor- tion to receive multiple client requests efficiently, and process mance investigations, the challenges at the the opposite end of them as a batch, instead of processing one request at a time, the spectrum, namely, improving performance for large pack- with a context switch per request. ets. is less well-understood. For the LMS model, receive-side batching is easily adapt- The ETL and cluster use-cases differ in the type of trans- able to the application paradigm. When the LMS is woken port protocol and socket API used for networking. Clus- from a poll(), select() or epoll, it begins reading ter services tend to be stateless, with un-connected datagram packets from the file descriptor until it either runs out of in- sockets, whereas the ETL traffic comes over a stateful TCP put, or runs out of buffers into which it can process the input. connection. UDP based cluster services manage user-space A typical LMS server has about 64 clients hashing to its ser- state involving sequence number management and acknowl- vice port, so that the likelihood of finding multiple packets edgement tracking with retransmissions to ensure guaranteed, waiting at the input queue is high. Thus batching on the re- reliable, ordered delivery over unconnected datagram sock- ceive side is beneficial to system performance. ets. The stateless nature of the cluster services, and the in- Batching of outgoing responses is a more complicated task. trinsic simplicity of the UDP protocol, renders them more Since a client is blocked until the response to an outstanding amenable to techniques that attempt to bypass the kernel pro- request comes back, and the client cannot send the next re- tocol implementations. As a result of this observation, the quest until the response is processed, inefficient transmit-side focus of this benchmarking study was on UDP based cluster batching by the server can aggravate burstiness in the network applications typified by the Lock Management Server. traffic, resulting in sub-optimal system performance. Lock Management Server The LMS implementation that we used for our study implemented receive-side batching of input, but did not batch The Lock Management Server (LMS) is a service provided the outgoing responses, by the Oracle Real Applications Cluster. The LMS is a Dis- tributed Lock Manager that is implemented as a set of pro- Reduction of system-call overhead Current deployments cesses in the cluster which handle transaction oriented ex- of LMS use UDP and RDS transport for inter-process changes with clients that wish to obtain read-only locks on communication. Both of these transports involve one specific buffers owned by back-end database instances. Ac- sendmsg() or recvmsg() call per I/O operation, thus quisition of the lock from the LMS is the bottleneck for the triggering the associated system call overhead per packet. Database Transactions. Reducing the latency of LMS trans- Performance boosting methods such as NETMAP[11] and actions is thus critical to system performance. The servers PF PACKET[7] allow the application to use shared memory listen on a dynamically determined range of ports, and an in- buffers with the kernel and eliminate the need for the sys- coming client request is assigned to a server based on a hash tem call per packet. In addition, the Linux kernel offers the of fields in the UDP payload. The client is blocked until the recvmmsg() system call since Linux 2.3.33, that allows the server’s response is received. In addition, the client has to application to read a batch of input datagrams in one system process the server’s response before it can send the next re- call. Both of these techniques were investigated as part of the quest. The interval between subsequent client requests is thus benchmarking effort. variable, and client input in this environment tends to have a Efficient management of context switches The LMS bursty profile. server has to switch between CPU-intensive back-end com- The server is the actual bottleneck for system performance putations for servicing the client’s request, and CPU cycles in the LMS environment. Computations at the server are to process network I/O. As a consequence of the receive-side CPU bound and occur as follows. The server has the targeted batching model, when the server runs out of network input buffer block loaded into its cache in the steady state. A sin- it falls back to poll(). This results in a context switch gle database instance holds the exclusive lock on the buffer that allows the back-end computation to progress. The ap- in this state. The block can only be modified by the owner of proach of having a dedicated CPU for network I/O and a dif- the exclusive lock. The LMS will have to make a consistent ferent CPU for back-end computation has the drawback that read-only copy of the block when it is requested by a client.

Evaluating and Improving Kernel Stack Performance for Datagram Sockets from the Perspective of RDBMS Applications

Beej's Guide to Unix IPC

Mmap and Dma

An Introduction to Linux IPC

Table of Contents

IPC: Mmap and Pipes

A Machine-Independent DMA Framework for Netbsd

Interprocess Communication

Real-Time Streaming Video and Image Processing on Inexpensive Hardware with Low Latency

Lecture 14: Paging

Process Address Spaces and Binary Formats

Analysis of Body Movement and Its Effects on Cyberware 3D Whole

CS162 Operating Systems and Systems Programming Lecture 19