Cryptographic Device Support for Freebsd Samuel J
Total Page:16
File Type:pdf, Size:1020Kb
Cryptographic Device Support for FreeBSD Samuel J. Leffler Errno Consulting [email protected] ABSTRACT FreeBSD recently adopted the OpenBSD Cryptographic Framework [Keromytis et al, 2003]. In doing so it was necessary to convert the core framework to function correctly in a fully-preemp- tive/multiprocessor operating system environment. In addition several issues with the basic design were found to cause significant performance loss. After addressing these issues we found that FreeBSD outperformed OpenBSD on identical hardware by as much as 100% in tests that exercise only the cryptographic framework. These optimizations result in similar performance improvements for facilities like IPsec that make heavy use of the cryptographic framework. We observed that FreeBSD’s Fast IPsec [Leffler, 2003] typically outperforms OpenBSD’s IPsec implementation [Miltchev et al, 2002] by more than 50% on identical hardware. We conclude that the OCF cryptographic API can be optimized and re-tuned to deliver substan- tially better performance than the original OCF implementation with large gains in both through- put and latency. Moreover these changes can be made with no impact on clients of the crypto- graphic framework: both user and kernel sofware designed for the original OCF is easily ported to the FreeBSD implementation of OCF. 1. Background and Introduction • The OCF was designed for a uniprocessor system Cryptographic transformations are an important com- without kernel preemption. The FreeBSD 5.0 ponent of security applications and protocols. operating system has fine-grained locking and the Because these operations are computationally expen- kernel is fully preemptive. This required a rewrite sive vendors have dev eloped products that accelerate of the core crypto functionality. the calculations and offload the work from the main • The I/O framework used on OpenBSD is different processor. The OpenBSD Cryptographic Framework than that used by FreeBSD. The crypto device (OCF) [Keromytis et al, 2003] was developed to pro- drivers required significant modification to deal vide a uniform interface to cryptographic resources. with these differences. It provides an in-kernel API to cryptographic • The OCF duplicates cryptographic support already resources and a device interface for user-level access present in FreeBSD for the KAME IPsec software. to hardware-accelerated cryptographic operations. In some cases the existing implementations are This functionality is critical for high performance superior to those provided by the OCF. Rather implementations of security protocols such as IPsec than duplicate this software each component was [Kent & Atkinson, 1998], SSL [Freier et al, 1996], evaluated and the best was chosen. TLS [Dierks & Allen, 1999], DNSSEC [Eastlake, 1999], ssh [Ylonen et al, 2002], and Kerberos [Kohl & After addressing these issues we looked at the perfor- Neuman, 1993]. Good cryptographic support is also mance and found that it was suboptimal. [Keromytis important for implementing encrypted secondary stor- et al, 2003] states the ‘‘OCF attains 95% of the theo- age [Gattaneo et al, 2001] and virtual memory retical peak device performance.’’ These results how- [Provos, 2000]. ev erare for relatively slow devices and fail to consider the overhead required to reach that performance. Fur- Recognizing the importance of this facility, FreeBSD thermore, the 95% figure is misleading in that it is recently adopted the OCF. Porting the software how- attained for a case that rarely occurs and where the ev errequired addressing several issues. majority of the overhead of the OCF is hidden by the raw computational cost. In analyzing the performance of OCF we found sev- 1) A core set of code that manages a registry of eral problems that contribute to suboptimal perfor- crypto device drivers, dispatches crypto operations mance for all operations: to drivers, and coordinates the return of results 1) The OCF requires two kernel thread context from drivers to the submitter. switches for each operation. On many systems the 2) Crypto device drivers that submit crypto opera- peak performance of the OCF is limited by the tions to hardware devices and return results to the rate at which the system can do context switches. crypto core. 2) The OCF forces all crypto operations to pass twice 3) The /dev/crypto pseudo-device driver that pro- through a single kernel thread. Combining the vides linkage between user-level software and the dispatch and return processing in a single thread core crypto support. increases latency. The core crypto support and the /dev/crypto driver 3) Crypto device drivers trade off throughput for are simple pieces of software. The crypto device latency. This has a detrimental effect on the per- drivers however represent a significant development formance of network protocols where latency is effort. Some aspects of cryptographic device hard- critical. ware that vendors refuse to disclose have been Furthermore it was observed that when crypto devices reverse-engineered and this work is embodied in these are overloaded the OCF reacts by discarding opera- drivers. In considering the integration of the OCF into tions instead of applying flow-control techniques. FreeBSD it was important to maintain the driver API This results in severe performance loss for network so that existing drivers could be easily reused and so protocols as this action is equivalent to discarding that ongoing development could be shared by both packets. OpenBSD and FreeBSD communities. With these problems corrected FreeBSD was found to The initial port of the OCF to FreeBSD 5.0 was rea- outperform OpenBSD on identical hardware by as sonably straightforward. Instead of using processor much as 100% in tests that exercise only the crypto- priority to insure critical code are not executed con- graphic framework. The overhead of using the cryp- currently, locking primitives were used to guard con- tographic facilities was dramatically reduced raising current access to data structures. This resulted in cer- the peak performance and making more of the CPU tain constructs being recast and some code being available for non-cryptographic work. These opti- rewritten. For example, the OCF blocks concurrent mizations result in similar performance improvements access to all its data structures by raising the proces- for facilities like IPsec that make heavy use of the sor priority to splimp. This effectively blocks all con- cryptographic framework. For example, FreeBSD’s current activity in the lower half of the kernel from re- Fast IPsec [Leffler, 2003] outperforms OpenBSD’s entering the code. In FreeBSD 5.0 however splimp IPsec implementation [Miltchev et al, 2002] by more does nothing; instead concurrency primitives such as a than 50% on identical hardware running netperf over mutex must be used. But simply replacing splimp a 3DES+SHA1 tunnel. usage with a single mutex does not work unless the mutex semantics are relaxed to permit recursive The remainder of this paper is organized as follows. acquisition. Instead each of the data structures used Section 2 describes the cryptographic framework and by the cryptographic framework were given their own discusses the effort needed to make it operate in a mutex and some code was reorganized to insure con- fully preemptive environment. Section 3 discusses sistent lock ordering is used to avoid deadlocks. The performance issues, starting with the tools used to end result is easier to understand and permits greater evaluate performance and progressing through each of concurrency. The only downside is that mutex primi- the issues that were identified in the software. Section tives hav emore overhead than simply manipulating 4 describes performance results for FreeBSD and the processor priority level. In practice this additional compares them to OpenBSD. Section 5 outlines the overhead is swamped by other costs and not critical to the status and availability of this work and talks about overall performance. future work. Section 6 giv esconclusions. 3. Performance Analysis 2. Porting the Cryptographic Framework to FreeBSD 5.0 This section briefly describes the tools and techniques used to evaluate performance and then discusses the The OCF is comprised of three components: performance problems that were identified. 3.1. Tools the peak processing capability of the hardware (mod- OpenBSD provides no statistics or other facilities for ulo the overhead of the device driver). In this case 8 understanding the performance of the OCF. The only kilobytes were processed in an average of 156 mil- tools for evaluating performance treat the system as a liseconds for a peak bandwidth of 410 Mb/s. Profiling ‘‘black box’’; e.g. the openssl tool from the OpenSSL adds a fixed overhead to each operation; reducing per- distribution. formance results by 2-8%. This facility is described in more detail in Section 4. We beg anby instrumenting the core software and each crypto device driver. Every error or failure is 3.2. Problems accounted for separately so problems can be deter- mined immediately. (This is especially important for The normal flow of crypto operations in the OCF is as understanding problems reported by naive users.) follows: Next, tools were developed to exercise the crypto 1) Client formulates a crypto operation and submits functionality. One such tool is the cryptotest pro- it through the ‘‘crypto_dispatch’’ routine. This gram that submits symmetric-key crypto operations routine places the request on a ‘‘dispatch queue’’ through the /dev/crypto device and verifies the and notifies a kernel thread. results. cryptotest measures performance over a 2) The kernel thread removes the operation from the range of operations and parameters and is a basic tool dispatch queue and calls ‘‘crypto_invoke’’ to hand for evaluating software changes and hardware config- the request off to the appropriate crypto device urations1. driver. Finally a profiling facility was added that timestamps 3) The crypto device driver processes the operation. crypto requests as they pass through the system. At Typically this is done by submitting one or more important points in the processing of crypto requests commands to the hardware device.