Exception-Less System Calls for Event-Driven Servers

Exception-Less System Calls for Event-Driven Servers Livio Soares Michael Stumm Department of Electrical and Computer Engineering University of Toronto Abstract sue kernel requests by writing to a reserved syscall page, shared between the application and the kernel, and then Event-driven architectures are currently a popular design switching to another user-level thread ready to execute choice for scalable, high-performance server applications. without having to enter the kernel. On the kernel side, For this reason, operating systems have invested in effi- special in-kernel syscall threads asynchronously execute ciently supporting non-blocking and asynchronous I/O, as the posted system calls obtained from the shared syscall well as scalable event-based notification systems. page, storing the results to the syscall page after their ser- We propose the use of exception-less system calls as vicing. This approach enables flexibility in the scheduling the main operating system mechanism to construct high- of operating system work both in the form of kernel re- performance event-driven server applications. Exception- quest batching, and (in the case of multi-core processors) less system calls have four main advantages over tra- in the form of allowing operating system and application ditional operating system support for event-driven pro- execution to occur on different cores. This not only sig- grams: (1) any system call can be invoked asyn- nificantly reduces the number of costly domain switches, chronously, even system calls that are not file descriptor but also significantly increases temporal and spacial local- based, (2) support in the operating system kernel is non- ity at both user and kernel level, thus reducing pollution intrusive as code changes are not required for each sys- effects on processor structures. tem call, (3) processor efficiency is increased since mode Our implementation of exception-less system calls in switches are mostly avoided when issuing or executing the Linux kernel (FlexSC) was accompanied by a user- asynchronous operations, and (4) enabling multi-core ex- mode POSIX compatible thread package, called FlexSC- ecution for event-driven programs is easier, given that a Threads, that transparently translates legacy synchronous single user-mode execution context can generate enough system calls into exception-less ones. FlexSC-Threads requests to keep multiple processors/cores busy with ker- primarily targets highly threaded server applications, such nel execution. as Apache and MySQL. Experiments demonstrated that We present libflexsc, an asynchronous system call and FlexSC-Threads increased the throughput of Apache by notification library suitable for building event-driven ap- 116% while reducing request latencies by 50%, and in- plications. Libflexsc makes use of exception-less system creased the throughput of MySQL by 40% while reducing calls through our Linux kernel implementation, FlexSC. request latencies by roughly 30%, requiring no changes to We describe the port of two popular event-driven servers, these applications. memcached and nginx, to libflexsc. We show that exception-less system calls increase the throughput of In this paper we report on our subsequent investiga- memcached by up to 35% and nginx by up to 120% as tions on whether exception-less system calls are suitable a result of improved processor efficiency. for event-driven application servers and, if so, whether exception-less system calls are effective in improving throughput and reducing latencies. Event-driven appli- 1 Introduction cation server architectures handle concurrent requests by using just a single thread (or one thread per core) so as to In a previous publication, we introduced the concept of reduce application-level context switching and the mem- exception-less system calls [28]. With exception-less ory footprint that many threads otherwise require. They system calls, instead of issuing system calls in the tradi- make use of non-blocking or asynchronous system calls tional way using a trap (exception) to switch to the ker- to support the concurrent handling of requests. The be- nel for the processing of the system call, applications is- lief that event-driven architectures have superior perfor- ) r mance characteristics is why this architecture has been e 70% t s Indirect widely adopted for developing high-performant and scal- a 60% f Direct s 50% able servers [12, 22, 23, 26, 30]. Widely used applica- i r 40% tion servers with event-driven architectures include mem- e w 30% o cached and nginx. l ( 20% The design and implementation of operating system n o i 10% t support for asynchronous operations, along with event- a 0% d based notification interfaces to support event-driven archi- a 1K 2K 5K 10K 20K 50K 100K r g user-mode instructions between exceptions tectures, has been an active area of both research and de- e D (log scale) velopment [4,8,7, 11, 13, 14, 16, 22, 23, 30]. Most of the proposals have a few common characteristics. First, the Figure 1: System call (pwrite) impact on user-mode instruc- interfaces exposed to user-mode are based on file descrip- tions per cycle (IPC) as a function of system call frequency for tors (with the exception of kqueue [4, 16] and LAIO [11]). Xalan. Consequently, resources that are not encapsulated as descriptors (e.g., memory) are not supported. Second, their implementation typically involved significant restructure exploit multiple processors (cores) is to use an operat- of kernel code paths into an asynchronous state-machine ing system visible execution context, be it a thread or in order to avoid blocking the user execution context. a process. With exception-less system calls, however, Third, and most relevant to our work, while the system operating system work can be issued and distributed to calls used to request operating system services are de- multiple remote cores. As an example, in our imple- signed not to block execution, applications still issue sys- mentation of memcached, a single memcached thread tem calls synchronously, raising a processor exception, was sufficient to generate work to fully utilize 4 cores. and switching execution domains, for every request, status Specifically, we describe the design and implementa- check, or notification of completion. tion of an asynchronous system call notification library, In this paper, we demonstrate that the exception-less libflexsc, which is intended to efficiently support event- system call mechanism is well suited for the construction driven programs. To demonstrate the performance ad- of event-based servers and that the exception-less mech- vantages of exception-less system calls for event-driven anism presents several advantages over previous event- servers, we have ported two popular and widely deployed based systems: event-based servers to libflexsc: memcached and nginx. 1. General purpose. Exception-less system call is a gen- We briefly describe the effort in porting these applications eral mechanism that supports any system call and is not to libflexsc. We show how the use of libflexsc can sig- necessarily tied to operations with file descriptors. For nificantly improve the performance of these two servers this reason, exception-less system calls provide asyn- over their original implementation using non-blocking I/O epoll chronous operation on any operating system managed and Linux’s interface. Our experiments demon- resource. strate throughput improvements in memcached of up to 35% and nginx of up to 120%. As anticipated, we show 2. Non-intrusive kernel implementation. Exception- that the performance improvements largely stem from in- less system calls are implemented using light-weight creased efficiency in the use of the underlying processor. kernel threads that can block without affecting user- mode execution. For this reason, kernel code paths do not need to be restructured as asynchronous state- 2 Background and Motivation: Operating machines; in fact, no changes are necessary to the code System Support for I/O Concurrency of standard system calls. Server applications that are required to efficiently han- 3. Efficient user and kernel mode execution. One of dle multiple concurrent requests rely on operating sys- the most significant advantages of exception-less system primitives that provide I/O concurrency. These prim- tem calls is its ability to decouple system call invoca- itives typically influence the programming model used tion from execution. Invocation of system calls can to implement the server. The two most commonly be done entirely in user-mode, allowing for truly asyn- used models for I/O concurrency are threads and non- chronous execution of user code. As we show in this blocking/asynchronous I/O. paper, this enables significant performance improve- Thread based programming is often considered the sim- ments over the most efficient non-blocking interface on plest, as it does not require tracking the progress of I/O op- Linux. erations (which is done implicitly by the operating system 4. Simpler multi-processing. With traditional system kernel). A disadvantage of threaded servers that utilize a calls, the only mechanism available for applications to separate thread per request/transaction is the inefficiency Server (workload) Syscalls per User Instructions User IPC Kernel Instructions Kernel IPC Request per Syscall per Syscall Memcached (memslap) 2 3750 0.80 5420 0.59 nginx (ApacheBench) 12 1460 0.46 6540 0.49 Table 1: The average number of instructions executed on different workloads before issuing a syscall, the average number of system calls required to satisfy a single request, and the resulting processor efficiency, shown as instructions per cycle (IPC) of both user and kernel execution. Memcached and nginx are event-driven servers using Linux’s epoll interface. of handling a large number of concurrent requests. The facilities, is largely responsible for the low efficiency of two main sources of inefficiency are the extra memory us- user and kernel execution, as quantified by the instruc- age allocated to thread stacks and the overhead of tracking tions per cycle (IPC) metric in Table1.

Load more