High-Performance TLS Asynchronous Offload Framework with Intel® Quickassist Technology

QTLS: High-Performance TLS Asynchronous Offload Framework with Intel® QuickAssist Technology Xiaokang Hu∗ Changzheng Wei Jian Li† Shanghai Jiao Tong University Intel Asia-Pacific R&D Ltd. Shanghai Jiao Tong University Intel Asia-Pacific R&D Ltd. [email protected] [email protected] [email protected] Brian Will Ping Yu Haibing Guan Intel Corporation Lu Gong Shanghai Jiao Tong University Chandler, Arizona, USA Intel Asia-Pacific R&D Ltd. [email protected] [email protected] {ping.yu,lu.gong}@intel.com Abstract CCS Concepts • Computer systems organization → Hardware accelerators are a promising solution to optimize Heterogeneous (hybrid) systems; • Networks → Secu- the Total Cost of Ownership (TCO) of cloud datacenters. This rity protocols; • Hardware → Hardware accelerators. paper targets the costly Transport Layer Security (TLS) and Keywords SSL/TLS, crypto operations, event-driven web investigates the TLS acceleration for the widely-deployed architecture, crypto accelerator, asynchronous offload event-driven TLS servers or terminators. Our study reveals an important fact: the straight offloading of TLS-involved 1 Introduction crypto operations suffers from the frequent long-lasting blockings in the offload I/O, leading to the underutilization Nowadays, cloud datacenters demand continuous perfor- of both CPU and accelerator resources. mance enhancement to fulfill the requirements of cloud To achieve efficient TLS acceleration for the event-driven services that are becoming globalized and scaling rapidly web architecture, we propose QTLS, a high-performance TLS [5, 12, 25]. Hardware accelerators, including GPU, FPGA and asynchronous offload framework based on® Intel QuickAs- ASIC, are a promising solution to optimize the Total Cost of sist Technology (QAT). QTLS re-engineers the TLS software Ownership (TCO) as both energy consumption and computa- stack and divides the TLS offloading into four phases to tion cost can be reduced by offloading similar and recurring eliminate blockings. Then, multiple crypto operations from jobs [12, 23, 29, 31]. different TLS connections can be offloaded concurrently in This paper targets an important type of datacenter work- one process/thread, bringing a performance boost. Moreover, load: the SSL/TLS processing in various TLS servers or ter- QTLS is built with a heuristic polling scheme to retrieve accel- minators [30, 56]. As the backbone protocols for Internet erator responses efficiently and timely, and a kernel-bypass security, SSL/TLS have been employed (typically in the form notification scheme to avoid expensive switches between of HTTPS) by 64.3% of the 137,502 most popular websites on user mode and kernel mode while delivering async events. the Internet as reported in November 2018 [34]. However, The comprehensive evaluation shows that QTLS can provide due to the continual involvement of crypto operations, partic- up to 9x connections per second (CPS) with TLS-RSA (2048- ularly the costly asymmetric encryption, the TLS processing bit), 2x secure data transfer throughput and 85% reduction incurs significant resource consumption compared to an in- of average response time compared to the software baseline. secure implementation on the same platform [6, 23, 32, 59]. There have been some studies that made efforts to offload ∗Work done as a software engineer intern in Intel DCG/NPG TLS-involved crypto operations to general-purpose acceler- † Corresponding author ators, such as GPU, FPGA and Intel® Xeon Phi™ processor Permission to make digital or hard copies of all or part of this work for [18, 22–24, 40, 43, 58, 59]. These studies mainly concentrated personal or classroom use is granted without fee provided that copies are not on the programming to enable efficient calculation of the made or distributed for profit or commercial advantage and that copies bear needed crypto algorithms (e.g., RSA [55]), without paying this notice and the full citation on the first page. Copyrights for components much attention to the offload I/O. However, our study reveals of this work owned by others than ACM must be honored. Abstracting with 1 credit is permitted. To copy otherwise, or republish, to post on servers or to that for the widely-deployed event-driven TLS servers or redistribute to lists, requires prior specific permission and/or a fee. Request terminators, such as Nginx [36], HAProxy [7] and Squid [8], permissions from [email protected]. the straight offloading of crypto operations is not sufficiently PPoPP ’19, February 16–20, 2019, Washington, DC, USA competent for the TLS acceleration. The main challenge is © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6225-2/19/02...$15.00 1handling multiple concurrent connections in one process/thread, instead https://doi.org/10.1145/3293883.3295705 of thread-per-connection 158 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al. the frequent long-lasting blockings in the offload I/O, lead- CLIENT TCP Connection SERVER ing to (1) a large amount of CPU cycles spent waiting and Client Hello (2) a low utilization of the parallel computation units in- Server Hello Asymmetric-key side the accelerator. As both CPU and accelerator resources Calculation Encrypted Premaster are underutilized, the anticipated performance enhancement TLS Handshake Crypto Keys Generation cannot be reached. Multiple PRF Client Finished Operations In this paper, we propose QTLS, a high-performance TLS Server Finished asynchronous offload framework based on® Intel QuickAs- Symmetric Cipher + MAC sist Technology (QAT) [20], to achieve efficient TLS accel- Secure Data Transfer eration for the event-driven web architecture. As a modern ASIC-based crypto acceleration solution, QAT is competitive Figure 1. Classic RSA-wrapped TLS processing in energy-efficiency and cost-performance compared tothe general-purpose accelerators [5, 25]. Moreover, it provides an enhanced by more than 2x and the average response time is additional security layer for sensitive data (e.g., private keys), reduced by nearly 85%. which can be securely protected inside the ASIC hardware In summary, this work makes the following contributions: and never be exposed to the memory [45]. To eliminate blockings in the offload I/O, QTLS re-engineers 1. An important fact is revealed: for the event-driven TLS the TLS software stack to enable the asynchronous support servers or terminators, the straight offloading of crypto for crypto operations in all the layers. The TLS offloading operations suffers from the frequent long-lasting block- is divided into four phases: pre-processing, QAT response ings in the offload I/O. retrieval, async event notification and post-processing. In 2. We propose and design the TLS asynchronous offload the pre-processing phase, the offload jobs are paused after framework for the event-driven web architecture. Due crypto submission to return control to the application pro- to the enormous performance advantages, it has been cess. When QAT responses for crypto results are retrieved, adopted by a number of service providers, like Alibaba the application process is notified by async events to resume for e-commence [41] and Wangsu for CDN [21]. the paused offload jobs and begin the post-processing phase. 3. A heuristic polling scheme and a kernel-bypass notifica- In this novel framework, CPU resources are fully utilized to tion scheme are designed to further enhance the perfor- handle concurrent connections. Multiple crypto operations mance of QTLS. from different TLS connections can be offloaded concurrently 4. We show that QTLS can be practically implemented with in one process/thread, which greatly increases the utilization Nginx and OpenSSL, and evaluate its performance with of the parallel computation engines inside the QAT accelera- extensive experiments. tor. To further enhance performance, QTLS is built with (1) a heuristic polling scheme that leverages the application-level 2 Background and Motivation knowledge to achieve efficient and timely QAT response This section first gives an overview of the TLS protocols and retrieval, and (2) a kernel-bypass notification scheme that the event-driven web architecture. Then, we present back- introduces an application-defined async queue to avoid ex- ground knowledge about Intel® QAT. Finally, we highlight pensive switches between user mode and kernel mode while the challenges with TLS offloading for the event-driven web delivering async events. architecture. QTLS leverages the QAT accelerator to offload TLS-involved crypto operations. Its main idea is also applicable to other 2.1 TLS Overview types of ASIC crypto accelerators. We have implemented Transport Layer Security (TLS) and its predecessor, Secure QTLS based on Nginx [36] and OpenSSL [13]. Nginx is an Sockets Layer (SSL), have been de-facto standards for the event-driven high-performance HTTP(S) and reverse proxy Internet security for more than 20 years [57]. TLS 1.0-1.2 server, used by 48.5% of the top one million websites [33]. protocols are currently dominant and the TLS 1.3 protocol OpenSSL is an SSL/TLS and crypto library, widely used on a has recently been ratified and published (as RFC 8446) by great variety of systems [53]. IETF [37]. The performance advantages of QTLS have been validated An TLS connection can be divided into two main phases through extensive experiments that covered both TLS 1.2 [10]: (1) the handshake phase that determines crypto al- and 1.3 protocols, both full and abbreviated (i.e., session gorithms, performs authentications and negotiates shared resumption) handshakes, and multiple mainstream cipher keys for data encryption and message authentication code suites. It’s demonstrated that QTLS greatly improves the (MAC); (2) the secure data transfer phase that sends en- TLS handshake performance, achieving up to 9x connections crypted data over the established TLS connection. The classic per second (CPS) with TLS-RSA (2048-bit) over the software RSA-wrapped TLS processing (i.e., the TLS-RSA cipher suite) baseline. In addition, the secure data transfer throughput is in the TLS 1.2 protocol is illustrated in Figure 1.

High-Performance TLS Asynchronous Offload Framework with Intel® Quickassist Technology

Leveraging Integrated Cryptographic Service Facility

(12) United States Patent (10) Patent No.: US 9,129,043 B2 Pandya (45) Date of Patent: Sep

The Fully Encrypted Data Center Encrypting Your Data Center on Oracle’S SPARC Servers

On Wide Area Network Optimization

About This Template

IBM Z15 Performance of Cryptographic Operations

Covering the Spectrum of Intel Communications Products

System Z and Z/OS Unique Characteristics

IBM Z14 (3906) Technical Guide

IBM Z15 Technical Introduction

Context-Aware and Real-Time Entrustment Framework For

System Z and Z/OS Unique Characteristics