Kqueue: a Generic and Scalable Event Notification Facility

Kqueue: A generic and scalable event notification facility Jonathan Lemon [email protected] FreeBSD Project Abstract delivered to the application, when a file in the filesystem changes in some fashion, or when a process exits. None Applications running on a UNIX platform need to be no- of these are handled efficiently at the moment; signal de- tified when some activity occurs on a socket or other delivery is limited and expensive, and the other events listed scriptor, and this is traditionally done with the select() or require an inefficient polling model. In addition, neither poll() system calls. However, it has been shown that the poll() nor select() can be used to collect these events, performance of these calls does not scale well with an in- leading to increased code complexity due to use of mul- creasing number of descriptors. These interfaces are also tiple notification interfaces. limited in the respect that they are unable to handle other This paper presents a new mechanism that allows the potentially interesting activities that an application might application to register its interest in a specific event, and be interested in, these might include signals, file system then efficiently collect the notification of the event at a changes, and AIO completions. This paper presents a later time. The set of events that this mechanism covers generic event delivery mechanism, which allows an ap- is shown to include not only those described above, but plication to select from a wide range of event sources, may also be extended to unforeseen event sources with and be notified of activity on these sources in a scalable no modification to the API. and efficient manner. The mechanism may be extended The rest of this paper is structured as follows: Section to cover future event sources without changing the appli- 2 examines where the central bottleneck of poll() and se- cation interface. lect() is, Section 3 explains the design goals, and Section 4 presents the API of new mechanism. Section 5 details 1 Introduction how to use the new API and provides some programming examples, while the kernel implementation is discussed Applications are often event driven, in that they perform in Section 6. Performance measurements for some ap- their work in response to events or activity external to plications are found in Section 7. Section 8 discusses the application and which are subsequently delivered in related work, and the paper concludes with a summary some fashion. Thus the performance of an application in Section 9. often comes to depend on how efficiently it is able to detect and respond to these events. 2 Problem FreeBSD provides two system calls for detecting activity on file descriptors, these are poll() and select(). The poll() and select() interfaces suffer from the defi- However, neither of these calls scale very well as the ciency that the application must pass in an entire list of number of descriptors being monitored for events be- descriptors to be monitored, for every call. This has an comes large. A high volume server that intends to handle immediate consequence of forcing the system to perform several thousand descriptors quickly finds these calls be- two memory copies across the user/kernel boundary, re- coming a bottleneck, leading to poor performance [1] [2] ducing the amount of memory bandwidth available for [10]. other activities. For large lists containing many thou- The set of events that the application may be interested sands of descriptors, practical experience has shown that in is not limited to activity on an open file descriptor. typically only a few hundred actually have any activity, An application may also want to know when an asyn- making 95% of the copies unnecessary. chronous I/O (aio) request completes, when a signal is Upon return, the application must walk the entire list to find the descriptors that the kernel marked as having Another goal was to keep the interface simple enough activity. Since the kernel knew which descriptors were that it could be easily understood, and also possible to active, this results in a duplication of work; the applica- convert poll() or select() based applications to the new tion must recalculate the information that the system was API with a minimum of changes. It was recognized that already aware of. It would appear to be more efficient to if the new interface was radically different, then it would have the kernel simply pass back a list of descriptors that essentially preclude modification of legacy applications it knows is active. Walking the list is an O(N) activity, which might otherwise take advantage of the new API. which does not scale well as N gets large. Expanding the amount information returned to the ap- Within the kernel, the situation is also not ideal. Space plication to more than just the fact that an event occurred must be found to hold the descriptor list; for large lists, was also considered desirable. For readable sockets, the this is done by calling malloc(), and the area must in user may want to know how many bytes are actually turn be freed before returning. After the copy is per- pending in the socket buffer in order to avoid multiple formed, the kernel must examine every entry to deter- read() calls. For listening sockets, the application might mine whether there is pending activity on the descriptor. check the size of the listen backlog in order to adapt to If the kernel has not found any active descriptors in the the offered load. The goal of providing more information current scan, it will then update the descriptor’s selinfo was kept in mind when designing the new facility. entry; this information is used to perform a wakeup on The mechanism should also be reliable, in that it the process in the event that it calls tsleep() while wait- should never silently fail or return an inconsistent state ing for activity on the descriptor. After the process is to the user. This goal implies that there should not be woken up, it scans the list again, looking for descriptors any fixed size lists, as they might overflow, and that any that are now active. memory allocation must be done at the time of the system This leads to 3 passes over the descriptor list in the call, rather when activity occurs, to avoid losing events case where poll or select actually sleep; once to walk the due to low memory conditions. list in order to look for pending events and record the As an example, consider the case where several net- select information, a second time to find the descriptors work packets arrive for a socket. We could consider each whose activity caused a wakeup, and a third time in user incoming packet as a discrete event, recording one event space where the user walks the list to find the descriptors for each packet. However, the number of incoming pack- which were marked active by the kernel. ets is essentially unbounded, while the amount of mem- These problems stem from the fact that poll() and se- ory in the system is finite; we would be unable to provide lect() are stateless by design; that is, the kernel does not a guarantee that no events would be lost. keep any record of what the application is interested in The result of the above scenario is that multiple pack- between system calls and must recalculate it every time. ets are coalesced into a single event. Events that are This design decision not to keep any state in the kernel delivered to the application may correspond to multiple leads to main inefficiency in the current implementation. occurrences of activity on the event source being moni- If the kernel was able to keep track of exactly which de- tored. scriptors the application was interested in, and only re- In addition, suppose a packet arrives containing turn a subset of these activated descriptors, much of the bytes, and the application, after receiving notification of overhead could be eliminated. ¡£¢ the event, reads ¡ bytes from the socket, where . The next time the event API is called, there would be ¦¥ ¡¨§ 3 Design Goals no notification of the ¤ bytes still pending in the socket buffer, because events would be defined in terms When designing a replacement facility, the primary goal of arriving packets. This forces the application to per- was to create a system that would be efficient and scal- form extra bookkeeping in order to insure that it does not able to a large number of descriptors, on the order of mistakenly lose data. This additional burden imposed several thousand. The secondary goal was to make the on the application conflicts with the goal of providing a system flexible. UNIX based machines have tradition- simple interface, and so leads to the following design de- ally lacked a robust facility for event notification. The cision. poll and select interfaces are limited to socket and pipe Events will normally considered to be “level- descriptors; the user is unable to wait for other types of triggered”, as opposed to “edge-triggered”. Another way events, like file creation or deletion. Other events re- of putting this is to say that an event is be reported as long quire the user to use a different interface; notably siginfo as a specified condition holds, rather than when activity and family must be used to obtain notification of signal is actually detected from the event source. The given events, and calls to aiowait are needed to discover if an condition could be as simple as “there is unread data in AIO call has completed.

Kqueue: a Generic and Scalable Event Notification Facility

Nvme-Based Caching in Video-Delivery Cdns

A Practical UNIX Capability System

The Linux Kernel Module Programming Guide

Procedures to Build Crypto Libraries in Minix

UNIX Systems Programming II Systems Unixprogramming II Short Course Notes

Toward MP-Safe Networking in Netbsd

Advancing Mac OS X Rootkit Detecron

Tinkertool System 7 Reference Manual Ii

ICP and the Squid Web Cache* 1 Introduction

Squirrel: a Decentralized Peer-To-Peer Web Cache

A Multiplatform Pseudo Terminal

Chapter 1. Origins of Mac OS X