Performance Study of Kernel TLS Handshakes

Performance study of kernel TLS handshakes 1st Alexander Krizhanovsky, 2nd Ivan Koveshnikov Tempesta Technologies, Inc. Seattle, USA [email protected], [email protected] Abstract Tempesta TLS [17] is a high performance TLS implementation and is the part of Tempesta FW project. Tempesta TLS Tempesta TLS, a part of Tempesta FW, implements TLS hand- is a Linux kernel module and at the moment is tightly coupled shakes in the Linux kernel and focuses on performance to filter with the main Tempesta FW module. However, in this paper out application layer DDoS attacks. We started development from the fork of mbed TLS library, but it took significant ef- we discuss required changes of the module to include it into fort to make it fast, so we ended up with full reworking of the the main Linux kernel tree. library. The design of Tempesta TLS was discussed in our pre- In this work we evaluate how much performance we can vious work [17]. get by moving TLS handshake into the kernel space. While The main focus of this paper is to explore how much perfor- our development work is still in progress we see that Tem- mance we can get for TLS handshake itself and the whole ap- pesta TLS establish 40-80% more TLS connections per sec- plication, on example of HTTPS server, by moving TLS hand- ond in comparison with OpenSSL/Nginx. We also cover the shake into the kernel space. While performance optimization modern research and CPU technologies, which are still not of the TLS handshakes mathematics is still in progress, we employed in TLS libraries such as OpenSSL or WolfSSL, but observed that Tempesta TLS can establish 40-80% more TLS can significantly improve TLS handshakes performance. connections per second than OpenSSL/Nginx and provide up to x4 lower latency in some tests. Besides higher performance, the kernel TLS handshake implementation allows to separate the management of private In this paper we study software optimizations of server-side keys and certificates from a web accelerator working process. full TLS 1.2 handshakes using ECDHE and ECDSA on the NIST P-256 curve as well as abbreviated TLS 1.2 handshakes. In this scheme the worker process doesn’t have access to the We conclude the paper with the architecture and the user space security sensitive data stored in the kernel space. A sepa- API proposals for the Linux kernel upstream. rate, privileged, process can be introduced to load and man- age TLS keys and certificates in the Linux kernel. This is a best practice for large secure web clusters [8], but it is still un- Keywords available for small installations without loss of performance. TLS, Linux kernel, fast computations on elliptic curves KPTI impact on performance User space TLS handshakes KPTI (Kernel Page Table Isolation) is a mitigation technol- TLS handshake is one of the slowest part of an HTTPS trans- ogy against recent Meltdown [27] vulnerability. Two sets of action on a web accelerator or a load balancer. TLS 1.2 ses- address space page tables are maintained in the kernel. One sion resumption and 0-rtt mode of TLS 1.3 solve the problem is ”full” and contains both the kernel and the user space ad- for returning users, but new users still cause significant load dresses. The other contains user space addresses and only the onto a server. Moreover, some modes, e.g. RSA handshakes minimal set of kernel-space mappings. Every syscall or in- or TLS 1.2 session resumption, require move intensive com- terrupt switches the minimal user space page table to ”full” putations on the server side than on the client side. These two kernel copy and switches them back on returning to the user factors make TLS handshakes an attractive target for asym- space [16]. This causes a notable overhead for syscall-heavy metric application level DDoS attacks [17]. and interrupt-heavy workloads [22, 30]. Tempesta FW [18] is an open source hybrid of a web ac- An in-kernel TLS handshake implementation does not celerator and an application level firewall embedded into the need to switch between the kernel and user spaces, so KPTI Linux TCP/IP stack to reach maximum performance and flex- is not involved. Our goal was to see how performance of a ibility. The main target for Tempesta FW is to provide an user-space HTTPS server suffers from KPTI. open source alternative to proprietary solutions like F5 BIG- We expected to see high KPTI impact on the performance IP and cloud based software stacks like CloudFlare. Such due to many network I/O system calls and a lot of memory solutions must efficiently process massive legitimate traffic allocations. But our measurements showed less than 4% per- as well as to filter out various DDoS and web attacks. formance drop on system with security mitigations enabled. TLS w/o KPTI w/ KPTI Delta The kernel and TLS versions impact on 1.2 8’735 8’401 -3.82% performance 1.3 7’734 7’473 -3.37% TLS handshake involves not so many network I/O system As Tempesta TLS was initially based on Linux kernel 4.14 calls, so the performance impact was not very high. All the and will be ported on the next LTS release after Linux 5.7, hottest functions are related to cryptographic operations and we also compared TLS handshake performance on that ker- memory allocations. Network I/O is not very high. Glibc nel versions. All mitigations available in both kernels was caches memory allocations to reduce the number of system switched on. Kernel Handshakes/s 95P Latency, ms calls. Thus, the overall kernel overhead was not high enough to cause a severe performance gap. Web server functions are TLS 1.2 New sessions also not noticeable in the perf report since no HTTP message 4.14 7’624 498 exchange happens in this scenario. At the below is CPU util- 5.7 7’284 466 isation by the hottest functions. -4.5% -6.4% TLS 1.2 Session Resumption Functions Overhead 4.14 19’203 246 libcrypto.so: 30.7% 5.7 17’452 112 ecp nistz256 mul montx, -9% -54% ecp nistz256 sqr montx, sha256 block data order avx2, TLS 1.3 New sessions ecp nistz256 avx2 gather w7, 4.14 7’147 315 OPENSSL cleanse, 5.7 6’811 466 ecp nistz256 ord sqr montx, -4.7% +47% ecp nistz256 point doublex, TLS 1.3 Session Resumption ecp nistz256 sub fromx, 4.14 6’472 287 ecp nistz256 mul by 2x, 5.7 6’183 342 ecp nistz256 point addx, -4.4% +19% ecp nistz256 point add affinex, The 5.7 kernel in all tests makes 4.5-9% less handshakes aesni ecb encrypt, in second than the 4.14 kernel. It’s also interesting that the BN num bits word, latency on TLS 1.2 is significantly lower on 5.7 kernel while EVP MD CTX reset TLS 1.3 shows the opposite picture. libc.so: int malloc, int free, 13.2% As only handshake performance is evaluated, no applica- malloc, malloc consolidate, cfree, tion data is sent and RTT between our servers is too small for memmove avx unaligned erms, real life scenario, TLS 1.3 can’t benefit from 0-RTT capabil- memset avx2 unaligned erms ity in this test. Moreover, TLS 1.3 handshake performance kernel: do syscall 64, en- 4.5% is always lower than TLS 1.2 for 7% in our tests. This hap- try SYSCALL 64, pre- pens because TLS 1.3 performs more encryption/decryption pare exit to usermode, and hashing operations. Also TLS 1.3 doesn’t show signifi- syscall return via sysret, com- cant performance boost on session resumption, instead it even mon interrupt loses 9%. Unlike in TLS 1.2, the master key is not directly nginx:- 0% reused in TLS 1.3, instead a unique shared secret is generated Since performance drop for the user-space HTTPS server and combined with the master key [35]. This performance is not high in this scenario, user-space HTTPS server can not trade-off allows forward secrecy and saves TLS 1.3 connec- benefit a lot by disabling KPTI. tion from the master key compromission. FPU state manipulations SRBDS impact on performance The Linux kernel provides kTLS [34] acceleration of TLS Mitigation against the special register buffer data sampling AES-GCM encryption and decryption: now sendfile() can (SRBDS) attack [12] can be applied only as a microcode up- transfer encrypted TLS payload without going to the user date. The mitigation affects RDRAND processor operation space for encryption. The implementation uses x86-64 CPU performance which is used in the handshake processing. Per- extensions involving the FPU, which is responsible for SIMD formance impact is higher for TLS 1.3 protocol, telling that CPU extensions and which is usually unavailable for the RDRAND operation is used in TLS 1.3 more frequently than Linux kernel code. To use the extensions the FPU state must in TLS 1.2. be saved with kernel fpu begin() function and then restored TLS w/o SRBDS w/ SRBDS Delta with kernel fpu end(). 1.2 8’735 8’281 -5.20% The FPU state manipulations aren’t cheap. In our previous 1.3 7’734 6’605 -14.60% work [17] we evaluated the cost of the FPU manipulations Both the user-space and kernel-space TLS handshakes suf- using a micro-benchmark [19]: copying of 1500 bytes with fer from SRBDS mitigation, unlike KPTI there is no possibil- the proper FPU state saving and restoring takes x4.78 more ity to avoid extra overhead by moving code to the kernel.

Load more