SKQ: Event Scheduling for Optimizing Tail Latency in a Traditional OS Kernel
Total Page:16
File Type:pdf, Size:1020Kb
SKQ: Event Scheduling for Optimizing Tail Latency in a Traditional OS Kernel Siyao Zhao Haoyu Gu RCS Lab, University of Waterloo RCS Lab, University of Waterloo Ali José Mashtizadeh RCS Lab, University of Waterloo Abstract (RSS) [8] that can improve scalability and latency. Userspace event libraries, e.g., libevent [33], are influenced by event sub- This paper presents Schedulable Kqueue (SKQ), a new design systems and propagate these issues. Research systems such to FreeBSD Kqueue that improves application tail latency and as Megapipe [17] and Affinity-Accept [35] improve event low-latency throughput. SKQ introduces a new scalable archi- subsystems but only solve part of the problem. tecture and event scheduling. We provide multiple scheduling The evolution of modern server applications and hardware policies that improve cache locality and reduce workload inspired us to revisit event facilities in traditional OS kernels imbalance. SKQ also enables applications to prioritize pro- to address the latency problem. Improving tail latency in a tra- cessing latency-sensitive requests over regular requests. ditional OS is challenging. Unlike kernel-bypass and custom In the RocksDB benchmark, SKQ reduces tail latency by up dataplanes, existing kernels are multipurpose and complex. to 1022× and extends the low-latency throughput by 27:4×. Kernel event facilities tightly integrate with many subsystems SKQ also closes the gap between traditional OS kernel net- including storage, network, and process management. Fur- working and a state-of-the-art kernel-bypass networking sys- thermore, we must carefully tradeoff our system’s overhead tem by 83.7% for an imbalanced workload. with the performance benefit to the rest of the system. This paper presents Schedulable Kqueue (SKQ), a novel 1 Introduction event notification facility based on FreeBSD Kqueue that provides a more flexible abstraction, and allows applications Applications and hardware have evolved tremendously. Mod- to use features of modern hardware. SKQ improves tail la- ern server applications span hundreds to thousands of nodes tency and extends low-latency throughput. SKQ introduces a across the data center to serve user requests. The depth and new architecture with improved scalability, fine-grained event breadth of the service tree cause users to experience request scheduling and event delivery control. Application threads latency dominated by the tail latency of any node in an ap- share a single, scalable SKQ instance, which automatically plication [11]. Advancements in storage and networking are schedules events across all threads. While improvements to making low latency services possible [2]. Recent research kernel event notification facilities have been proposed [17,35], systems propose using kernel-bypass and custom dataplanes SKQ differentiates itself with the following contributions: to reduce latency [3, 19, 23, 37]. However, most applications in data centers are still built • SKQ offers multiple application-controlled scheduling directly on top of traditional OSes and employ an event- policies to improve cache locality and workload imbal- driven approach, which uses an event loop that polls the ance. We present guidelines for policy selection in § 4.3. kernel’s event subsystem. Even popular user-level thread- • SKQ presents explicit control over event delivery includ- ing systems internally depend on kernel event subsystems to ing event pinning and prioritization. dispatch events efficiently across user-level threads, e.g., Go • SKQ has been extensively tested and deployed on pro- language [16], Arachne [38] and Fred [27]. duction servers. We also developed event libraries to Event subsystems in traditional OSes such as Kqueue [28] facilitate integration in existing applications. in FreeBSD and Mac OS X, epoll [30] in Linux, IO Com- pletion Ports (IOCP) [9] in Windows and Event Completion We evaluate SKQ with microbenchmarks and multiple real Framework [4] in Solaris were developed nearly 20 years ago, world workloads with different characteristics. We examine predating many innovations in modern hardware. These event workloads with uniform and non-uniform request service subsystems undermine features such as receive side scaling times, and IO-bound workloads. In microbenchmarks, we show that SKQ reduces lock contention, improves cache local- First, it hinders efficient event migration. Moving events be- ity and multicore scalability. In our RocksDB [14] benchmark tween Kqueues requires two system calls that remove events running the Facebook ZippyDB workload, SKQ extends the from the source Kqueue and add them to the target Kqueue. low-latency throughput by 27.4× and reduces latency by up This migration process also involves multiple kernel and to 1022×. Event prioritization allows a saturated server to userspace locks, leading to poor scalability. Second, this service low-latency requests to a high priority client with 8× model interacts poorly with RSS in modern NICs as applica- lower latency while having little performance impact on other tions cannot detect nor react to RSS connection migration. clients. Additionally, SKQ closes the gap between traditional The 1:N model means all threads share a single Kqueue OS kernel networking and a state-of-the-art kernel-bypass and process connections in a round-robin fashion. This model networking system by 83.7% for an imbalanced workload. maximizes CPU utilization and simplifies event scheduling, but suffers from two problems. First, the model leads to lock contention on multicore machines (see §6). Second, 2 Background and Related Work the model does not preserve connection affinity as each con- nection is processed by different threads, which results in SKQ was motivated by the design and performance problems cache misses. As a result, applications and event libraries with existing event subsystems. While our observations hold have mostly avoided the 1:N model except for a few low- on popular platforms, i.e., Linux epoll and Windows IOCP, throughput services [25]. we will explain everything in the context of FreeBSD Kqueue. SKQ uses a hybrid architecture that offers the best of both Sources of Latency: Li et al. [29] studied several server worlds through our event scheduler. SKQ exposes a 1:N model applications and identified multiple factors in the OS that to applications for efficient event scheduling and delivery contribute to high tail latency. The authors suggested reduc- while internally using a 1:1 model for multicore scalability. ing background tasks, pinning threads to cores and avoiding Need for Application Control: Applications often need NUMA effects. Even after following all the recommendations, to deliver events to a specific thread. For example, one thread we found two additional factors that lead to high tail latency. needs to notify another thread of control events. Kqueue One factor is cache misses due to receive side scaling does not provide applications using the 1:N model with fine- (RSS) [8] found in modern NICs. RSS creates a send and grained event delivery control. SKQ allows applications to a receive NIC queue per core and distributes connections pin events to specific threads. across them using hash functions. Recent implementations of SKQ also introduces application controlled event prioritiza- RSS also load balance by migrating groups of connections tion, which allows applications to prioritize latency-sensitive between cores. However, applications are unable to detect events over batch-processing events. To our knowledge, no this and will process the connection on the original core, existing kernel event subsystem supports event prioritization. losing connection affinity and causing cache misses. In our The latest generation NICs provide hardware support for event Memcached benchmark, we found that up to 77% of total prioritization with application device queues (ADQs) [20]. L2 cache misses are due to improper connection affinity and SKQ can be used with ADQs to improve event prioritization are avoidable. SKQ maintains connection affinity and follows throughout the networking stack. connection migration with the CPU affinity scheduling policy. RFS, RPS and Intel Flow Director: Receive Flow Steer- Another factor is workload imbalance that arises from dif- ing (RFS) [12] and Receive Packet Steering (RPS) [6] are ferences in request service time and suboptimal connection Linux networking stack features. RFS improves connection distribution across worker threads. Workload imbalance over- locality by enabling the kernel to process packets on the core saturates some worker threads while under-saturating others. where the corresponding application thread is running. RPS To put this into perspective, we measured the difference in is a software implementation of hardware RSS, which pro- total processing time between the most and least busy threads vides hash-based packet distribution across cores. Intel Flow in two workloads. In Memcached, a uniform workload, we Director [18] is a hardware implementation of RFS. SKQ measured a 2.8% difference. In a GIS application with a Zipf- eliminates the need for RFS with the CPU affinity policy. like service time distribution, we measured a 46% difference. io_uring: io_uring [7] is a recent Linux kernel feature As a result, the GIS application’s tail latency increases much that enables efficient asynchronous IO and avoids system faster than Memcached as a function of throughput. SKQ call overhead via polling. Each io_uring object uses a single offers scheduling policies to