QTLS: High-Performance TLS Asynchronous Offload Framework with Intel® QuickAssist Technology

Xiaokang Hu∗ Changzheng Wei Jian Li† Shanghai Jiao Tong University Intel Asia-Pacific R&D Ltd. Shanghai Jiao Tong University Intel Asia-Pacific R&D Ltd. [email protected] [email protected] [email protected] Brian Will Ping Yu Haibing Guan Intel Corporation Lu Gong Shanghai Jiao Tong University Chandler, Arizona, USA Intel Asia-Pacific R&D Ltd. [email protected] [email protected] {ping.yu,lu.gong}@intel.com

Abstract CCS Concepts • Computer systems organization → Hardware accelerators are a promising solution to optimize Heterogeneous (hybrid) systems; • Networks → Secu- the Total Cost of Ownership (TCO) of cloud datacenters. This rity protocols; • Hardware → Hardware accelerators. paper targets the costly (TLS) and Keywords SSL/TLS, crypto operations, event-driven web investigates the TLS acceleration for the widely-deployed architecture, crypto accelerator, asynchronous offload event-driven TLS servers or terminators. Our study reveals an important fact: the straight offloading of TLS-involved 1 Introduction crypto operations suffers from the frequent long-lasting blockings in the offload I/O, leading to the underutilization Nowadays, cloud datacenters demand continuous perfor- of both CPU and accelerator resources. mance enhancement to fulfill the requirements of cloud To achieve efficient TLS acceleration for the event-driven services that are becoming globalized and scaling rapidly web architecture, we propose QTLS, a high-performance TLS [5, 12, 25]. Hardware accelerators, including GPU, FPGA and asynchronous offload framework based on® Intel QuickAs- ASIC, are a promising solution to optimize the Total Cost of sist Technology (QAT). QTLS re-engineers the TLS software Ownership (TCO) as both energy consumption and computa- stack and divides the TLS offloading into four phases to tion cost can be reduced by offloading similar and recurring eliminate blockings. Then, multiple crypto operations from jobs [12, 23, 29, 31]. different TLS connections can be offloaded concurrently in This paper targets an important type of datacenter work- one process/thread, bringing a performance boost. Moreover, load: the SSL/TLS processing in various TLS servers or ter- QTLS is built with a heuristic polling scheme to retrieve accel- minators [30, 56]. As the backbone protocols for Internet erator responses efficiently and timely, and a kernel-bypass security, SSL/TLS have been employed (typically in the form notification scheme to avoid expensive switches between of HTTPS) by 64.3% of the 137,502 most popular websites on user mode and kernel mode while delivering async events. the Internet as reported in November 2018 [34]. However, The comprehensive evaluation shows that QTLS can provide due to the continual involvement of crypto operations, partic- up to 9x connections per second (CPS) with TLS-RSA (2048- ularly the costly asymmetric encryption, the TLS processing bit), 2x secure data transfer throughput and 85% reduction incurs significant resource consumption compared to an in- of average response time compared to the software baseline. secure implementation on the same platform [6, 23, 32, 59]. There have been some studies that made efforts to offload ∗Work done as a software engineer intern in Intel DCG/NPG TLS-involved crypto operations to general-purpose acceler- † Corresponding author ators, such as GPU, FPGA and Intel® Xeon Phi™ Permission to make digital or hard copies of all or part of this work for [18, 22–24, 40, 43, 58, 59]. These studies mainly concentrated personal or classroom use is granted without fee provided that copies are not on the programming to enable efficient calculation of the made or distributed for profit or commercial advantage and that copies bear needed crypto algorithms (e.g., RSA [55]), without paying this notice and the full citation on the first page. Copyrights for components much attention to the offload I/O. However, our study reveals of this work owned by others than ACM must be honored. Abstracting with 1 credit is permitted. To copy otherwise, or republish, to post on servers or to that for the widely-deployed event-driven TLS servers or redistribute to lists, requires prior specific permission and/or a fee. Request terminators, such as Nginx [36], HAProxy [7] and Squid [8], permissions from [email protected]. the straight offloading of crypto operations is not sufficiently PPoPP ’19, February 16–20, 2019, Washington, DC, USA competent for the TLS acceleration. The main challenge is © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6225-2/19/02...$15.00 1handling multiple concurrent connections in one process/thread, instead https://doi.org/10.1145/3293883.3295705 of thread-per-connection

158 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al.

the frequent long-lasting blockings in the offload I/O, lead- CLIENT TCP Connection SERVER ing to (1) a large amount of CPU cycles spent waiting and Client Hello (2) a low utilization of the parallel computation units in- Server Hello Asymmetric-key side the accelerator. As both CPU and accelerator resources Calculation Encrypted Premaster are underutilized, the anticipated performance enhancement TLS Handshake Crypto Keys Generation cannot be reached. Multiple PRF Client Finished Operations In this paper, we propose QTLS, a high-performance TLS Server Finished asynchronous offload framework based on® Intel QuickAs- Symmetric Cipher + MAC sist Technology (QAT) [20], to achieve efficient TLS accel- Secure Data Transfer eration for the event-driven web architecture. As a modern ASIC-based crypto acceleration solution, QAT is competitive Figure 1. Classic RSA-wrapped TLS processing in energy-efficiency and cost-performance compared tothe general-purpose accelerators [5, 25]. Moreover, it provides an enhanced by more than 2x and the average response time is additional security layer for sensitive data (e.g., private keys), reduced by nearly 85%. which can be securely protected inside the ASIC hardware In summary, this work makes the following contributions: and never be exposed to the memory [45]. To eliminate blockings in the offload I/O, QTLS re-engineers 1. An important fact is revealed: for the event-driven TLS the TLS software stack to enable the asynchronous support servers or terminators, the straight offloading of crypto for crypto operations in all the layers. The TLS offloading operations suffers from the frequent long-lasting block- is divided into four phases: pre-processing, QAT response ings in the offload I/O. retrieval, async event notification and post-processing. In 2. We propose and design the TLS asynchronous offload the pre-processing phase, the offload jobs are paused after framework for the event-driven web architecture. Due crypto submission to return control to the application pro- to the enormous performance advantages, it has been cess. When QAT responses for crypto results are retrieved, adopted by a number of service providers, like Alibaba the application process is notified by async events to resume for e-commence [41] and Wangsu for CDN [21]. the paused offload jobs and begin the post-processing phase. 3. A heuristic polling scheme and a kernel-bypass notifica- In this novel framework, CPU resources are fully utilized to tion scheme are designed to further enhance the perfor- handle concurrent connections. Multiple crypto operations mance of QTLS. from different TLS connections can be offloaded concurrently 4. We show that QTLS can be practically implemented with in one process/thread, which greatly increases the utilization Nginx and OpenSSL, and evaluate its performance with of the parallel computation engines inside the QAT accelera- extensive experiments. tor. To further enhance performance, QTLS is built with (1) a heuristic polling scheme that leverages the application-level 2 Background and Motivation knowledge to achieve efficient and timely QAT response This section first gives an overview of the TLS protocols and retrieval, and (2) a kernel-bypass notification scheme that the event-driven web architecture. Then, we present back- introduces an application-defined async queue to avoid ex- ground knowledge about Intel® QAT. Finally, we highlight pensive switches between user mode and kernel mode while the challenges with TLS offloading for the event-driven web delivering async events. architecture. QTLS leverages the QAT accelerator to offload TLS-involved crypto operations. Its main idea is also applicable to other 2.1 TLS Overview types of ASIC crypto accelerators. We have implemented Transport Layer Security (TLS) and its predecessor, Secure QTLS based on Nginx [36] and OpenSSL [13]. Nginx is an Sockets Layer (SSL), have been de-facto standards for the event-driven high-performance HTTP(S) and reverse proxy Internet security for more than 20 years [57]. TLS 1.0-1.2 server, used by 48.5% of the top one million websites [33]. protocols are currently dominant and the TLS 1.3 protocol OpenSSL is an SSL/TLS and crypto library, widely used on a has recently been ratified and published (as RFC 8446) by great variety of systems [53]. IETF [37]. The performance advantages of QTLS have been validated An TLS connection can be divided into two main phases through extensive experiments that covered both TLS 1.2 [10]: (1) the handshake phase that determines crypto al- and 1.3 protocols, both full and abbreviated (i.e., session gorithms, performs authentications and negotiates shared resumption) handshakes, and multiple mainstream cipher keys for data encryption and message authentication code suites. It’s demonstrated that QTLS greatly improves the (MAC); (2) the secure data transfer phase that sends en- TLS handshake performance, achieving up to 9x connections crypted data over the established TLS connection. The classic per second (CPS) with TLS-RSA (2048-bit) over the software RSA-wrapped TLS processing (i.e., the TLS-RSA cipher suite) baseline. In addition, the secure data transfer throughput is in the TLS 1.2 protocol is illustrated in Figure 1.

159 QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC,USA

Handshake: The RSA-wrapped handshake requires two run run run ••• run network round trips to exchange Hello messages, encrypted process process process process Premaster and Finished messages, as shown in Figure 1. At or thread or thread or thread or thread assigned assigned assigned assigned the server side, one asymmetric-key calculation (RSA sign) crypto instance crypto instance crypto instance ••• crypto instance request/response is involved for authentication, which is the most computing- ••• ring pairs intensive part of the TLS processing [6, 23]. Besides, multiple load balance pseudo random function (PRF) [54] operations are needed computation ••• for key derivation. For other cipher suites in the TLS 1.2 engines protocol with forward secrecy [3], such as the ECDHE-RSA A QAT endpoint and its usage model cipher suite [48], the handshake is more complicated and Figure 2. involves two more asymmetric elliptic-curve cryptography (ECC) [47] calculations at the server side. In the TLS 1.3 the TLS resource consumption at the server side, the pro- protocol, although one network round trip can be saved in tection afforded by forward secrecy is compromised42 [ ]. In the handshake phase, the involved crypto operations cannot real-life deployment, service providers restrict the lifetime be omitted [9]. Also, the enhanced security requires more of session IDs or tickets (generally less than an hour [42]) to key derivation operations with the new HMAC-based key achieve a trade-off between performance and security. derivation function (HKDF) [28]. Table 1 summarizes the number of server-side crypto operations to perform a full 2.2 Event-driven Web Architecture handshake. There are two competitive architectures for web applications: Table 1. Server-side crypto operations for full handshake one is based on threads, while the other is based on events [11]. Thread-based web applications (e.g., Apache [14]) ini- TLS Cipher Suite RSA ECC PRF/HKDF tiate a new thread for each incoming connection. This suf- 1.2 TLS-RSA 1 0 4 fers from costly thread switches and high system resource 1.2 ECDHE-RSA 1 2 4 consumption under high traffic. In contrast, event-driven 1.2 ECDHE-ECDSA 0 3 4 web applications (e.g., Nginx [36]) have the ability to handle 1.3 ECDHE-RSA 1 2 > 4 thousands of concurrent connections in one process/thread, bringing significant advances in both performance and scal- ability [15, 44]. To align with the multi-core architecture, Secure data transfer: This phase is based on the nego- multiple worker processes can be launched simultaneously tiated crypto keys, including at least two cipher keys and to handle incoming connections in a balanced manner. two MAC keys. Each key pair is used for data encryption To consolidate multiple connections to a single flow of and MAC calculation in one direction. Prior to encryption execution, the event-driven web architecture works with and transmission, the data object is fragmented into units of network sockets in an asynchronous (non-blocking) mode 16KB if it is larger than this. and monitors them with an event-based I/O multiplexing Resource consumption: TLS imposes a much higher re- mechanism (e.g., epoll [49] or kqueue [52]). In the so-called source consumption over an insecure implementation due event loop, the application process waits for events on the to the costly crypto operations [6, 23, 32, 59]. A recent study monitored sockets. Once a queue of new events is returned, [35] reported that a modern CPU core, which is able to the corresponding event handlers are invoked one by one. serve over 30K HTTP connections per second, cannot handle more than 0.5K TLS handshakes per second with the main- ® stream ECDHE-RSA (2048-bit) cipher suite. The overhead 2.3 Intel QuickAssist Technology mainly originates from the heavy asymmetric-key calcula- Intel® QuickAssist Technology (QAT) is an ASIC-based solu- tions [6, 23]. As the requested web object increases in size, tion to enhance security and compression performance for the computation overhead incurred by data encryption and cloud, networking, big data and storage applications [20]. MAC calculation becomes considerable as well. Performance Both kernel driver and userspace driver (implemented as a testing [35] showed that when the request size increases library) are available to meet different requirements. from 1 KB to 100 KB, the throughput drop compared to the Usage model: As illustrated in Figure 2, a QAT endpoint unencrypted case increases strongly from 45.7% to 85.4%. possesses multiple parallel computation engines and a num- Session resumption: TLS provides a shortcut named ses- ber of hardware-assisted request/response ring pairs. Software sion resumption (with either session ID or session ticket) writes requests onto a request ring and reads responses back that avoids performing the full handshake each time [10]. from a response ring. The QAT hardware load-balances re- Specifically, the asymmetric-key calculations can be skipped quests from all rings across all available computation engines. in the handshake phase on later connections from the same The availability of responses can be indicated using either client. Although session resumption is effective in reducing interrupt or polling.

160 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al.

Application ••• C1 waiting C1 C2 waiting C2 ••• Several request/response ring pairs (for different crypto Process blocking blocking types) are grouped into a QAT crypto instance, which is ac- TLS Library tually a logical unit that can be assigned to a process/thread blocking blocking QAT Engine running on the CPU core. A modern QAT endpoint can QAT Driver request request support up to 48 crypto instances to scale with the multi- response response core architecture. If there are multiple QAT endpoints in QAT Instance calculating calculating the server, one process can be assigned with multiple QAT Computation instances from different endpoints to employ more compu- ••• ••• tation engines. Engines Parallelism: On the one hand, crypto requests submitted TLS straight offload mode with QAT from different processes to different QAT instances canbe Figure 3. calculated in parallel on multiple computation engines. On the other hand, concurrent crypto requests submitted from For the event-driven web architecture, the blockings in one process to one QAT instance can also be calculated in the offload I/O lead to the underutilization of both CPUand parallel as long as there are available computation engines. QAT resources. First is the large amount of CPU cycles spent Here, "concurrent" means to submit the next request without waiting. After a crypto request is submitted to the accelerator, waiting for the completion of previous ones. If there are the application process (typically running on a dedicated sufficient concurrent requests, it is possible to fully loadall core) is in the waiting state (either busy-looping or sleeping parallel computation engines with only one or two QAT to wait for the QAT response), stopping the cycle of handling instances. events for a long time. Second is the low utilization of the OpenSSL QAT Engine: QAT integrates its crypto service parallel computation engines inside the QAT accelerator. into OpenSSL (a widely-used crypto library) [13] through a Since the second crypto request can only be submitted after standard engine named QAT Engine, which allows the seam- the completion of the first one, for each application process, less use by applications. When the QAT Engine is loaded, no more than one computation engine can be employed at applications can transparently offload crypto operations to the same time. the QAT accelerator via the normal OpenSSL API . In general, the QAT Engine supports the offloading of asymmetric cryp- 3 QTLS Design tography (e.g., RSA, ECDH and ECDSA), symmetric chained To achieve efficient TLS acceleration for the event-driven cipher (e.g., AES128-CBC-HMAC-SHA1) and PRF. web architecture, we propose QTLS, a high-performance TLS asynchronous offload framework that leverages the QAT 2.4 Challenges with TLS Offloading accelerator to offload crypto operations. Its main idea is also A straight offload mode for TLS processing is to replace the applicable to other types of ASIC crypto accelerators. crypto-related function call with an I/O call that interacts with the hardware accelerator. This mode reuses the existing 3.1 Overview TLS software stack with only few modifications. However, An overview of QTLS framework is shown in Figure 4. To for the event-driven web architecture, this kind of straight eliminate blockings in the offload I/O, QTLS re-engineers offloading is not sufficiently competent and cannot reachthe the TLS software stack to enable the asynchronous support anticipated performance enhancement. for crypto operations in , including: the QAT In the software-based TLS processing, the crypto-related all the layers Engine layer, the TLS library layer and the application layer. function call is a synchronous blocking call to use the CPU It divides the TLS offloading into four phases: resource for crypto calculation. Both the TLS library layer and the application layer cannot move on if a crypto opera- • Pre-processing: When a TLS handler (e.g., the handshake tion has not been completed yet. This kind of design works handler) encounters a crypto operation, the offload job well with the CPU-only architecture, but in the straight of- is paused immediately after the crypto request is sub- fload mode, it causes the frequent long-lasting blockings in mitted to the QAT accelerator. The application process the offload I/O, as illustrated in Figure 3. Although theQAT continues execution to handle other connections, instead driver provides an inherently non-blocking interface (i.e., of being blocked and waiting for the QAT response. Mul- request/response), the QAT Engine cannot return control tiple crypto operations from different connections (C1, to upper layers after it submits a crypto request from the C2, ...) can be offloaded concurrently in one process. connection C1. That’s to say, it has to wait until the QAT • QAT Response Retrieval: As more and more concurrent accelerator completes this request and generates a response. crypto requests are submitted, the QAT responses for Only after the response for crypto result is consumed, the them need to be retrieved in time. A heuristic polling application process can continue execution to generate the scheme is designed here to achieve efficient and timely next crypto request (giving rise to the next blocking). retrieval.

161 QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC,USA

Pre-processing

C1 C2 C3 ••• Added to Post-processing Monitoring Async Event Notification QAT Response Retrieval Async Event C1 C2 C3 ••• Handshake TLS Write ••• TLS ASYNC Kernel-bypass Notification TLS State Machine Application-level knowledges Handshake TLS ASYNC TLS Write •••

TLS Library (async crypto support) crypto pause Resonse Callback TLS State Machine Heuristic Polling crypto operation Scheme crypto QAT Engine poll for new QAT Userspace Driver TLS Library (async crypto support) responses resumption non-blocking new responses crypto API request QAT Instance crypto QAT Userspace Driver QAT Instance QAT Instance calculation QAT Engine result

Computation ••• Engines

Figure 4. An overview of QTLS framework

• Async Event Notification: Each time a new QAT response The application layer is the initiator and terminator of is retrieved, the pre-registered response callback gener- async crypto operations. QTLS introduces a new state named ates an async event to inform the application about the TLS-ASYNC into the application-maintained TLS state ma- completion of an async crypto request. A kernel-bypass chine, which is connected to all other TLS states (e.g., Hand- notification scheme is designed here to avoid expensive shake and TLS-Write) that may involve crypto operations. switches between user mode and kernel mode while de- Suppose that the currently-handled connection C1 is in livering async events. Handshake state (i.e., executing the handshake handler), as • Post-processing: According to the captured async event, shown in Figure 4. Once the TLS state machine identifies a the corresponding TLS connection along with the same paused offload job from the specific error code returned by TLS handler will be rescheduled by the application pro- the TLS library, it changes the TLS state from Handshake cess to resume the paused offload job. Since the crypto to TLS-ASYNC. Each paused offload job corresponds toan calculation result is now ready, this connection can move async handler. Here, it is set to the handshake handler, indi- on to the next step. cating that the same handler needs to be rescheduled later In this novel framework, CPU resources are fully utilized to conduct the post-processing phase. In addition, the appli- to handle concurrent connections. Furthermore, the con- cation layer is responsible to monitor all the paused offload currently offloaded crypto requests greatly increase the uti- jobs and wait for async events that signal the completion of lization of the parallel computation engines inside the QAT crypto requests. The detailed design about monitoring and accelerator. notification will be presented together in §3.4. The QAT Engine layer is a bridge between the TLS library 3.2 Pre-processing and Post-processing layer and the QAT driver layer. On the one hand, it registers a response callback (used to generate an async event) when As discussed in §2.4, the root cause of the blockings in the it uses the non-blocking API provided by the QAT driver to offload I/O is that both the TLS library layer and the applica- submit a crypto request. On the other hand, after the submis- tion layer cannot move on if a crypto operation has not been sion of a crypto request, it uses the crypto pause provided by completed yet. We circumvent this restriction by introduc- the TLS library to return control to the upper application. ing asynchronous crypto behaviors into the TLS software A special case is the failure of crypto submission, possibly stack and splitting the offload I/O into a pre-processing phase because the QAT request ring is already full. QTLS resolves (where async offload jobs are paused) and a post-processing this case by reusing the proposed framework. The offload phase (where async offload jobs are resumed). job is paused after an unsuccessful crypto submission. The In the TLS library layer, QTLS introduces crypto pause application is notified about this failure through either an and . The former is leveraged by the QAT crypto resumption async event or a specific return code. The same TLS handler Engine to pause the current offload job after a crypto request will be rescheduled later to resume the paused offload job is submitted and return control to the upper application with and retry the crypto submission. a specific error code. The latter is leveraged by the application process to resume a paused offload job to consume the crypto result. The pause and resumption of an offload job require 3.3 QAT Response Retrieval careful management of the TLS context and the crypto state. QAT responses can be retrieved through either interrupt or We have developed two types of implementations based on polling. QTLS leverages userspace I/O for crypto offloading, OpenSSL, which will be elaborated in §4.1. where one userspace-based polling operation has much less

162 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al. overhead than one kernel-based interrupt [19]. Therefore, 3.4 Async Event Notification QTLS selects polling to achieve high throughput, especially For the event-driven web architecture, a natural way for at high-load state. A convenient polling method is to initiate the async event notification is to reuse the existing event- an independent thread for each application process to poll driven approach. Specifically, each time an async offload job the assigned QAT instances at regular intervals, as what is paused, a file descriptor (FD)51 [ ] is allocated to it and the QAT Engine does by default, but it comes with several monitored by the application’s I/O multiplexing mechanism, drawbacks inevitably: together with the network sockets. When a QAT response is retrieved, the response callback function writes an event on • It introduces frequent context switches between applica- the corresponding FD to inform the application. tion processes and polling threads. However, as file descriptors are maintained in the kernel, • It is hard to determine an appropriate polling interval: a this kind of FD-based notification inevitably introduces ex- small interval may lead to a large amount of ineffective pensive switches between user mode and kernel mode. In polling operations while a big interval may incur a long order to avoid this performance penalty, we design a kernel- latency or even impede the throughput. bypass notification scheme. An application-defined async queue is introduced to work as the monitoring and notifi- To achieve better performance and adapt well to varied cation channel. When the TLS state machine identifies a traffic, we design a heuristic polling scheme for QTLS. Since paused offload job, it already has the knowledge about its the application is the producer of crypto requests, it has async hander, which is actually the TLS handler needed to be the best knowledge about the proper time to retrieve QAT rescheduled when the application receives the async event. responses. The heuristic polling scheme is integrated into This knowledge can be shared to the lower software stack. the application to (1) avoid an independent polling thread Then, when a QAT response is retrieved, the response call- and (2) leverage the application-level knowledge to guide back function can complete notification just by inserting the the polling behaviors. Sometimes, a polling operation should corresponding async handler into the tail of the async queue. be postponed to coalesce sufficient responses; however, an- This async queue will be processed at the end of the main other time, a polling operation may need to be executed event loop. As long as there exist inflight crypto requests, the immediately to reduce latency. The heuristic polling scheme main event loop keeps execution, instead of sleep-waiting integrates both efficiency and timeliness, with the goal to for events on the I/O multiplexing mechanism. make the response retrieve rate always match the request submission rate. Efficiency: When there are a large number of concurrent 4 Implementation Details TLS connections (i.e., high offload traffic), it is reasonable QTLS can be implemented on various event-driven TLS to coalesce sufficient QAT responses into one polling oper- servers or terminators (e.g. Nginx, HAProxy and Squid). This ation. The number of inflight (submitted but not retrieved section presents the implementation based on Nginx [36] with a response) crypto requests is an appropriate indica- and OpenSSL [13]. Nginx uses one master process for man- tor and a polling operation is triggered when this number agement and multiple worker processes for doing the real reaches a pre-defined threshold. A notable fact is that the work. asymmetric-key calculation takes a much longer time than the calculation of other types of crypto requests. Therefore, a bigger threshold is used if there exist inflight asymmetric 4.1 OpenSSL Async Crypto Support crypto requests. We have developed the crypto pause and resumption in Timeliness: During off-hours with few TLS connections, OpenSSL for several years, and there are two kinds of imple- a timely polling is important and necessary. An TLS con- mentations: stack async and fiber async. nection can be identified as (1) active if it is processing the Stack Async: Initially, we implemented the crypto pause handshake, reading an HTTPS request or writing the HTTPS and resumption by altering the normal sequence of program response back; or (2) idle if it is waiting for a request from execution according to the state flag, as illustrated in Figure the end client (including the keepalive case). In the current 5. When the crypto API (e.g., RSA sign) is invoked by the TLS design, each active TLS connection can only have one async API (e.g., handshake) for the first time, it executes normally crypto request at the same time. Therefore, once the num- to submit the crypto request. After a successful submission, ber of inflight async crypto requests equals the number of it sets the state flag to inflight and returns to the caller. The active TLS connections, a polling operation needs to be exe- state flag is set to ready when the QAT response is retrieved. cuted immediately. Otherwise, the application process may Then, the same TLS API is called again to conduct crypto get stuck as all active connections are waiting for QAT re- resumption, which requires careful skipping of some TLS op- sponses. This timeliness constraint greatly reduces the query erations (e.g, ssl3_get_message) that have been completed in latency when there are only few end clients. the first call. As the state flag is ready now, the same crypto

163 QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC,USA

TLS API TLS API TLS API ASYNC PAUSE TLS API ASYNC ASYNC_start_job(NULL, ...) Crypto API Begin careful skipping FINISH start a new ASYNC_JOB ASYNC_start_job(job, ...) context swap resume an ASYNC_JOB crypto submit Crypto API Begin context swap Crypto API Begin cont ext swap success fail jump to retry pause point job = ASYNC_get_current_job() flag: inflight flag: retry crypto result crypto submit job != NULL consume crypto result

non-blocing crypto submission Crypto API End Crypto API End Crypto API End ••• Crypto API End

Pause Fail Resume Retry ASYNC_pause_job()

Crypto Pause Crypto Resumption Figure 5. Workflow of stack async Figure 6. Workflow of fiber async API jumps over the crypto submission part to directly con- sume the crypto result. For the failure of crypto submission, ssl_shutdown. These functions are modified to recognize the the state flag is set to retry and the application is responsible new error code from OpenSSL named SSL_ERROR_WANT_ to call the same TLS API later to retry the submission. The ASYNC, which indicates that an async crypto request has stack async implementation has a good performance but it been submitted. If this new error code is received, the async is intrusive to OpenSSL as its API is not compatible with the handler of the paused offload job is is set to the current existing synchronous API. handler (e.g., ngx_ssl_handshake_handler). Fiber Async: The intrusive stack async implementation Ideally, after an async offload job is paused and the execu- is unacceptable to the OpenSSL Community. After discus- tion is returned to Nginx, the only type of event expected for sion with OpenSSL maintainers, we agreed to a new imple- this TLS connection is an async event, which will resume the mentation that unifies the asynchronous and synchronous paused job. However, the possibility exists that a read event behaviors with the fiber mechanism50 [ ]. Fibers are cooper- (from the network socket) occurs before the expected async ative lightweight threads of execution in user space, which, event. This read event could be the next message during the in effect, provide a natural way to organize concurrent code TLS handshake or the first HTTP request after connection based on asynchronous I/O [27]. By using explicit yield- establishment. This kind of event disorder may cause the ing and cooperative scheduling, fiber async implementation HTTP or TLS State Machine to go back to a previous state. manages multiple execution paths for async offload jobs in To address this issue, QTLS clears and saves the handler of one process/thread, as illustrated in Figure 6. the read event when an async event is being expected, and When the TLS API (e.g., handshake) is called for the first restores it when the async event is being processed. time, it uses ASYNC_start_job to start a new fiber-based ASYNC_JOB to encapsulate (fiber context swap here) the run- 4.3 Heuristic Polling Scheme ning piece of the TLS connection. The fiber-based ASYNC_ The information about the inflight crypto requests is col- can be paused at any point to return control to the JOB lected in the QAT Engine layer for accuracy. The different main code and resumed later to directly jump to the pause types of TLS crypto operations are counted independently, point. If the invoked crypto API (e.g., RSA sign) identifies denoted by Rasym, Rcipher and Rpr f respectively. Each time an async offload job through ASYNC_get_current_job, it first a crypto-related function in the QAT Engine (e.g., qat_rsa_ submits a crypto request in non-blocking mode and then priv_dec) is invoked, the corresponding number (e.g., R ) leverages to return control (fiber context asym ASYNC_pause_job is increased by one. Once a QAT response is retrieved, the swap here). When the same TLS API is called again, it uses corresponding number is subtracted by one in the response with the paused ASYNC_JOB as an argu- ASYNC_start_job callback function. The total number of inflight crypto re- ment to directly jump (fiber context swap here) to the pause quests (denoted by R ) is calculated by adding these three point and consume the crypto result. total variables together and shared to the upper application with The fiber async implementation has a slight performance a new engine command. penalty due to the fiber management and switches, but itis Nginx has a stub_status module that collects useful infor- compatible with the existing API. It has been included into mation including the number of alive connections and the OpenSSL official releases since version 1.1.0 series. number of idle connections that are waiting for a request. Based on this, QTLS adds the checking for TLS-enabled con- 4.2 Nginx Modifications for Async Crypto Support nections and calculates the number of active TLS connections The crypto pause may occur in the following functions: ngx_ (denoted by TCactive ) with the the equation: TCactive = ssl_handshake, ngx_ssl_handle_recv, ngx_ssl_write and ngx_ TCalive − TCidle .

164 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al.

The timeliness and efficiency constraints can be satisfied server with one Intel® DH8970 PCIe QAT card, which con- either when Rtotal increases or when TCactive decreases. tains three independent QAT endpoints. Therefore, in Nginx, wherever a crypto operation may be The tested QTLS was based on Nginx 1.10.3 and OpenSSL involved or TCactive may be updated , it is required to check 1.1.0g. Nginx supports a number of working modes, such as these constraints and determine whether to execute a polling a web server, a reverse proxy server or a load balancer. Since operation. The default thresholds for the efficiency constraint all these modes share the same TLS processing module, we are set to 48 (for Rasym > 0) and 24 (for Rasym = 0). We have only configured Nginx as an HTTPS web server to conduct opened the threshold setting in the Nginx configuration file. the testing. In addition, the fiber async implementation was In our large amount of tests, the heuristic polling scheme used in the evaluation as it has been included into OpenSSL works well to retrieve all QAT responses in time. For a official releases. failover, a timer (e.g., five milliseconds) can be set inside In the tested server, multiple Nginx workers were con- Nginx to periodically check whether at least one heuristic figured to occupy the same number of dedicated hyper- polling operation is triggered during the last interval. If not threading (HT) cores. Each worker process was equipped but there exist in-flight offload requests, an extra polling with one QAT instance. The allocated QAT instances were operation can be executed at once. distributed evenly from the three QAT endpoints. The timer- based polling thread (needed in some configurations) for 4.4 Async Event Notification each Nginx worker was pinned to the same core as the FD-based notification: QAT Engine uses the set-FD API to worker. associate a FD with an async job before crypto pause. The The evaluation was conducted in terms of TLS handshake application uses the get-FD API at proper time to obtain the performance, secure data transfer throughput and average re- information about added FDs (need to be added into moni- sponse time. We covered both TLS 1.2 and 1.3 protocols, both toring) and deleted FDs (need to be deleted from monitoring). full and abbreviate (i.e., session resumption) handshakes, and Normally, a new FD should be created for each paused offload multiple mainstream cipher suites in the evaluation. To eval- job. Considering the fact that each TLS connection can only uate all aspects of QTLS, the following five configurations have one async offload job at the same time, an optimization are compared: for lower overhead is to share one FD across all async jobs • SW: software calculation with modern AES-NI instruc- from the same TLS connection. tions for TLS-involved crypto operations. Kernel-bypass notification: An application-level call- • QAT+S: straight offload mode with the timer-based polling back function is introduced to manipulate the async queue. thread (10 µs polling interval by default) for QAT re- This function takes the async handler information as the sponse retrieval. argument and inserts it into the tail of the async queue. • QAT+A: TLS asynchronous offload framework with the In OpenSSL, two new members, named callback and call- timer-based polling thread and the FD-based notification back_arg, are added to the data structure of the fiber-based scheme. ASYNC_JOB. Meanwhile, a series of new APIs are introduced • QAT+AH: replacing the polling thread with the heuristic for users to set or get them. polling scheme, based on the QAT+A configuration. Each time the TLS state machine identifies a paused offload • QTLS: the full QTLS with both heuristic polling scheme job, it invokes the SSL_set_async_callback API to set the and kernel-bypass notification scheme. application-level callback and the async handler. When a QAT response is retrieved, the response callback function 5.2 Full Handshake Performance leverages the ASYNC_WAIT_CTX_get_callback API to check whether the callback member has been set for the current The TLS handshake performance was measured with the offload job. If so, it can directly invoke the application-level OpenSSL s_time tool. In each of the two client servers, 1000 callback function with callback_arg (i.e., the async-handler s_time processes were simultaneously launched to establish information) as the argument to complete notification. new TLS connections (i.e., full handshake) with the Nginx running on the tested server. We first evaluated the full handshake performance for three mainstream cipher suites 5 Evaluation in the TLS 1.2 protocol: TLS-RSA, ECDHE-RSA and ECDHE- 5.1 Experimental Setup ECDSA. We established an experimental testbed with one tested The full handshake performance of TLS-RSA (2048-bit) is server and two client servers that were connected back-to- shown in Figure 7a. Here "2HT" indicates that two Nginx back via Intel® XL710 40GbE NICs. All servers were equipped workers are running on two dedicated hyper-threading (HT) with two 22-core Intel® Xeon® E5-2699 v4 CPU (hyper- cores (belonging to the same physical core). It can be seen threading enabled) and 64GB RAM. We ran CentOS 7.3 with that the connections per second (CPS) value increases lin- Linux Kernel 3.10 on them. QTLS was installed on the tested early for all five configurations when the number of Nginx

165 QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC,USA

105 42 18 SW SW SW QAT+A QTLS 90 QAT+S 36 QAT+S 15 QAT+S QAT+AH 75 QAT+A 30 QAT+A QAT+AH QAT+AH 12 60 24 QTLS QTLS 9 45 18 6 30 12 15 6 3 Connections per second (K) 0 Connections per second (K) 0 Connections per second (K) 0 2HT 4HT 8HT 16HT 24HT 32HT 2HT 4HT 8HT 12HT 16HT 20HT P-256 P-384 B-283 B-409 K-283 K-409

(a) TLS-RSA (2048-bit) (b) ECDHE-RSA (2048-bit) (c) ECDHE-ECDSA (six NIST curves)

Figure 7. Full handshake performance for three mainstream cipher suites in the TLS 1.2 protocol

35 120 105 SW SW SW 90 30 QAT+S 100 QAT+S QAT+S QAT+A 75 QAT+A 25 QAT+A 80 QAT+AH QAT+AH QAT+AH QTLS 60 QTLS 20 QTLS 60 15 45 40 30 10 20 15 5 Connections per second (K) Connections per second (K)

Connections per second (K) 0 0 0 2HT 4HT 8HT 12HT 16HT 20HT 2HT 4HT 8HT 12HT 16HT 20HT 2HT 4HT 8HT 12HT 16HT 20HT (a) 100% abbreviated handshakes (b) Full / Abbreviated = 1 : 9 Figure 8. Full handshake performance in TLS 1.3 with ECDHE-RSA (2048-bit) Figure 9. Session resumption performance in TLS 1.2 with ECDHE-RSA (2048-bit) workers varies from 2 to 24. Taking the 8HT case as an ex- with four Nginx workers, as shown in Figure 7c. A striking ample, the SW configuration gives 4.3K CPS. The straight phenomenon is that for the P-256 curve, the SW configura- offloading of RSA and PRF operations (i.e., the QAT+S config- tion provides an abnormal high CPS and outperforms the uration) improves the CPS by a factor of two. In the QAT+A QAT+S configuration significantly. This is because the prime configuration, the TLS asynchronous offload framework fully of the P-256 curve is "Montgomery Friendly" and can be utilizes both CPU and QAT resources, bringing a CPS boost implemented into the Montgomery domain [17]. This fea- to 29.5K, which is a 7x improvement over the SW config- ture makes the ECDSA (P-256) sign 2.33x faster than the uration. In the QAT+AH configuration, an additional 20% traditional implementation. Nevertheless, QTLS enhances CPS enhancement (i.e., 35.8K) was obtained by the heuristic the CPS by more than 70% compared to the special software polling for QAT responses. Finally, the kernel-bypass notifi- implementation. It is worth noting that not all ECC prime cation scheme further improves the CPS by 8%, leading to a curves can be optimized by using the Montgomery domain. CPS value of 38.8K in the QTLS configuration. In general, the For the stronger P-384 curve, QTLS provides a 14x CPS im- full QTLS provides a 9x CPS improvement over the software provement over the software baseline. Moreover, according baseline. For the 32HT case, both the QAT+AH and the QTLS to SafeCurves [1], the widely-used P-256 curve has poten- configurations give about 100K CPS, achieving the upper tial security risks. Therefore, we also evaluated another four limit of the DH8970 QAT card. counterpart curves: NIST binary curves (B-283 and B-409) Figure 7b shows the handshake performance of ECDHE- and NIST koblitz curves (K-283 and K-409). For all these four RSA (2048-bit), which involves two more ECC calculations curves, QTLS enhances the CPS by more than 12x compared compared to the TLS-RSA cipher suite. The OpenSSL default to the SW configuration. curve, NIST P-256, was used for the ECC computation. The TLS 1.3: We also evaluated the full handshake perfor- QAT+S configuration shows no CPS improvement over the mance for the TLS 1.3 protocol (provided by OpenSSL 1.1.1- SW configuration due to the blockings in the offload I/O. pre8) with the cipher suite ECDHE-RSA (2048-bit), as shown The QAT+A configuration with the TLS asynchronous of- in Figure 8. QTLS provides a 3.5x CPS improvement over fload framework improves the CPS by a factor of more than the software baseline, lower than the TLS 1.2 case. The main four. The heuristic polling scheme and kernel-bypass notifi- reason is that the TLS 1.3 protocol introduces a new key cation scheme give additional 20% and 6% CPS enhancement derivation function named HKDF, which cannot be offloaded respectively. The full QTLS provides a 5.5x CPS improve- through the QAT Engine currently. ment over the software baseline, with 16 Nginx workers to achieve the QAT upper limit (i.e., 40K CPS). 5.3 Session Resumption Performance For another mainstream cipher suite ECDHE-ECDSA that Session resumption are widely used in real-life deployment involves three ECC calculations, we evaluated six NIST curves to reduce the resource consumption at the server side. In the

166 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al.

35 3.5 SW QAT+AH SW QAT+A 30 QAT+S QTLS 3.0 QAT+S QTLS QAT+A 25 2.5

20 2.0

15 1.5

10 1.0 Throughput (Gbps) 5 0.5 Average response time (ms) 0 0 4 16 32 64 128 256 512 1024 1 2 4 6 8 12 16 32 64 128 256 Requested file size (KB) Number of concurrent end clients

Figure 10. Secure data transfer with varied files Figure 11. Latency under different concurrencies

105 28 9 10µs 10µs 90 24 7.5 1ms 1ms 75 20 Heuristic 6 Heuristic 60 16 4.5 45 12 10µs 3 30 8 1ms

Response time1.5 (ms) 15 Throughput (Gbps) 4 Heuristic Connections per second (K) 0 0 0 2 4 8 12 16 20 24 28 32 16 32 48 64 96 128 192 256 512 1 2 4 6 8 12 16 32 64 Number of Nginx workers Number of concurrent end clients Number of concurrent end clients

(a) Full handshake with TLS-RSA (2048-bit) (b) Secure data transfer with 64 KB file (c) Average response time

Figure 12. Comparison between the timer-based polling thread and the heuristic polling scheme

TLS 1.2 protocol, an abbreviated handshake for session re- launched and the keepalive setting was tuned to avoid the sumption involves PRF calculations only. We first evaluated influence of TLS handshake. In a client sever, 400 ab processes the session resumption performance with 100% abbreviated were configured to continuously request for a fixed file. handshakes. The evaluation environment is identical to the The experiments results are shown in Figure 10. For the full handshake testing. A difference is that the "reuse" option 4 KB case, QTLS only provides a slight higher throughput was used in the s_time tool to establish resumed connections. compared to the SW configuration. As the requested file As shown in Figure 9a, with abbreviated handshakes only, size increases, the performance benefits of QTLS becomes QTLS can provide a 30%-40% CPS enhancement over the bigger because the data encryption for the file incurs more software baseline. In contrast, the straight offload mode (i.e., cipher operations at the server side. Taking the 128 KB case the QAT+S configuration) suffers from the blockings inthe as an example, one file incurs eight (calculated by 128/16) offload I/O and gives an obviously lower CPS compared to cipher operations. The QAT+A configuration improves the the SW configuration without TLS offloading. throughput by 60% compared to the SW configuration. The As mentioned in §2.1, session resumption compromises combination of heuristic polling scheme and kernel-bypass the protection afforded by forward secrecy and the lifetime notification scheme contributes additional 25%-30% through- of session IDs or tickets is typically restricted. Real-life traffic put enhancement. The full QTLS provides more than 2x is a mixture of full and abbreviated handshakes. The ratio throughput improvement over the software baseline. between them depends on the security policy. We evaluated an example ratio of 1:9 (i.e., 10% full handshakes) with the cipher suite ECDHE-RSA (2048-bit), as shown in Figure 9b. 5.5 Average Response Time Compared to the SW configuration, QTLS improves the CPS The average response time was evaluated with different con- by more than 2x. For a specific ratio between full and abbre- currencies, i.e., different numbers of concurrent end clients. viated handshakes, the CPS enhancement with the cipher Only one Nginx worker was launched in the tested server. suite ECDHE-RSA (2048-bit) ranges between 1.3x and 5.5x. In a client server, multiple ab processes were launched to A bigger percentage of full handshakes (i.e., more stringent simulate end clients and make concurrent requests for a security policy) leads to a higher CPS improvement. small-size page (less than 100 bytes) with the cipher suite TLS-RSA (2048-bit). A full handshake is needed for each 5.4 Secure Data Transfer Throughput request. We used Apachebench (ab) to measure secure data transfer As shown in Figure 11, for a concurrency of 1, the QAT+S throughput with the requested file size varying from 4KB configuration has the lowest latency due to the busy-loop to 1024 KB. The cipher suite for secure data transfer was waiting for the crypto result. The QTLS configuration with AES128-SHA. In the tested server, eight Nginx workers were the heuristic polling scheme has the second smallest response

167 QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC,USA time. This is because once the number of inflight offload re- in [39] to reduce the network overhead. Castelluccia et al. quests equals the number of active TLS connections (here, [4] examined a technique for re-balancing RSA-based clien- 1), a heuristic polling operation is immediately performed to t/server handshakes, which puts more work on clients to retrieve at least one QAT response. The QAT+A configura- achieve better SSL performance on the server. tion shows a longer response time since the QAT response CPU-level improvement: AES instruction set, an ex- retrieval is only triggered after a fixed time interval (here, tension to the instruction set architecture, has been 10 µs). The SW configuration has the highest latency due integrated into many processors to improve the speed of to the software calculation of the computing-intensive RSA encryption and decryption using AES [46]. Kounavis et al. operation. [26] presented a set of processor instructions along with With the increase of concurrency, the advantage of the software optimizations that can be used to accelerate end-to- TLS asynchronous offload framework becomes more signifi- end encryption and authentication. In [16], a novel constant cant. Since multiple crypto operations from different connec- run-time approach was proposed to implement fast modular tions can be offloaded concurrently, the QAT+A configura- exponentiation on IA processors. tion achieves about 75% latency reduction over the software Hardware offloading: This is a promising solution be- baseline when the concurrency reaches 64. After adding cause accelerators typically provide much faster crypto cal- the heuristic polling scheme and kernel-bypass notification culation than the CPU. A number of studies have been con- scheme, the full QTLS can decrease the average response ducted to enable TLS-involved crypto algorithms (e.g., RSA, time by nearly 85%. AES and HMAC) on different types of hardware accelerators, including GPU [18, 23, 43, 58], FPGA [22, 24, 40] and Intel® 5.6 Polling Thread vs. Heuristic Polling Xeon Phi™ processor [59]. The main focus of these studies We evaluated the timer-based polling thread with different is to program the general-purpose accelerators to enable polling intervals to prove the advantages of the heuristic efficient crypto calculation. In contrast, the QAT accelerator leveraged in this work inherently supports high-performance polling scheme. Three testing scenarios were designed with and energy-efficient calculation for mainstream crypto al- the TLS asynchronous offload framework: (1) 10 µs: using gorithms. Our focus is to eliminate blockings in the offload the timer-based polling thread with a 10 µs interval; (2) 1 ms: I/O and achieve high-performance TLS offloading for the using the timer-based polling thread with a 1 ms interval; (3) event-driven TLS servers/terminators. heuristic: using the proposed heuristic polling scheme. The experimental results are shown in Figure 12. The 10 µs configuration brings a relatively low handshake per- 7 Conclusion formance (about 20% gap) because it suffers from frequent In this paper, we proposed and designed QTLS, a QAT-based context switches and a large amount of ineffective polling high-performance TLS asynchronous offload framework, operations. For the 1 ms configuration, when there is only for the popular event-driven web architecture. QTLS re- a small number of concurrent end clients, it not only gives engineers the TLS software stack and divides the TLS of- a high latency but also lowers the throughput dramatically floading into four phases: pre-processing, QAT response re- as it has to wait for 1 ms before retrieving QAT responses. trieval, async event notification and post-processing. This The heuristic polling scheme leverages the application-level novel framework eliminates blockings in the offload I/O to knowledge to make the response retrieve rate match the fully utilize both CPU and QAT resources. Moreover, a heuris- request submission rate. It achieves the best performance for tic polling scheme and a kernel-bypass notification scheme all the above experiments and adapts well to varied network were designed to further enhance the system performance. traffic (i.e., different concurrencies). We have implemented QTLS based on the widely-deployed Nginx and OpenSSL. The comprehensive performance eval- 6 Related Work uation demonstrated that QTLS outperforms the software- based TLS processing by a significant margin, especially in TLS acceleration has been attracting more and more atten- the handshake phase. tion due to its increasing importance. The main TLS perfor- mance bottleneck is the involved costly crypto operations. A variety of efforts have been made to accelerate TLS from Acknowledgments the following aspects. The authors would like to thank the anonymous referees Software acceleration: Shacham et al. [38] presented an for their valuable comments and thank the PC/AE chairs algorithmic approach, batching the SSL handshakes on the for their assistance to solve the encountered problems. This web server with Batch RSA, to speed up SSL. Boneh et al. [2] work is supported in part by the National Key Research and described three fast variants of RSA (Batch RSA, Multi-factor Development Program of China (No. 2016YFB1000502) and RSA and Rebalanced RSA) and analyzed their security issues. the National Science Fund for Distinguished Young Scholars A client-side caching of handshake information is proposed (No. 61525204).

168 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al.

References 17th USENIX Security Symposium (Security). 195–210. ® [1] Daniel J. Bernstein and Tanja Lange. 2017. SafeCurves: choosing safe [19] Intel. 2014. Intel QuickAssist Technology Performance Optimization curves for elliptic-curve cryptography. Retrieved November 30, 2018 Guide. Technical Report. https://01.org/sites/default/files/page/ from https://safecurves.cr.yp.to/ 330687_qat_perf_opt_guide_rev_1.0.pdf ® ® [2] Dan Boneh and Hovav Shacham. 2002. Fast variants of RSA. Crypto- [20] Intel. 2018. Intel QuickAssist Technology (Intel QAT). Retrieved Bytes 5, 1 (2002), 1–9. November 30, 2018 from https://www.intel.com/content/www/us/ [3] Ran Canetti, Shai Halevi, and Jonathan Katz. 2003. A Forward-Secure en/architecture-and-technology/intel-quick-assist-technology- Public-Key Encryption Scheme. In Proceedings of the International overview.html Conference on the Theory and Applications of Cryptographic Techniques [21] Intel and Wangsu. 2018. Working Together to Build a High (EUROCRYPT). 255–271. Efficiency CDN System for HTTPS. Technical Report. https: [4] Claude Castelluccia, Einar Mykletun, and Gene Tsudik. 2006. Improv- //01.org/sites/default/files/downloads/intelr-quickassist-technology/ ing secure server performance by re-balancing SSL/TLS handshakes. i12036-casestudy-intelqatcdnaccelerationen337190-001us.pdf In Proceedings of the ACM Symposium on Information, computer and [22] Takashi Isobe, Satoshi Tsutsumi, Koichiro Seto, Kenji Aoshima, and communications security (ASIACCS). 26–34. Kazutoshi Kariya. 2010. 10 Gbps Implementation of TLS/SSL Accel- [5] Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, erator on FPGA. In Proceedings of the 18th International Workshop on Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Quality of Service (IWQoS). 1–6. Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin [23] Keon Jang, Sangjin Han, Seungyeop Han, Sue B Moon, and KyoungSoo Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Park. 2011. SSLShader: Cheap SSL Acceleration with Commodity Chiou, and Doug Burger. 2016. A Cloud-Scale Acceleration Archi- Processors. In Proceedings of the 8th USENIX conference on Networked tecture. In Proceedings of the 49th Annual IEEE/ACM International Systems Design and Implementation (NSDI). Symposium on Microarchitecture (MICRO). 7:1–7:13. [24] Zia-Uddin-Ahamed Khan and Mohammed Benaissa. 2015. [6] Cristian Coarfa, Peter Druschel, and Dan S Wallach. 2006. Performance Throughput/area-efficient ECC processor using Montgomery Analysis of TLS Web servers. ACM Transactions on Computer Systems point multiplication on FPGA. IEEE Transactions on Circuits and (TOCS) 24, 1 (2006), 39–69. Systems II: Express Briefs 62, 11 (2015), 1078–1082. [7] HAProxy Community. 2018. HAProxy - The Reliable, High Perfor- [25] Moein Khazraee, Lu Zhang, Luis Vega, and Michael Bedford Taylor. mance TCP/HTTP Load Balancer. Retrieved November 30, 2018 from 2017. Moonwalk: Nre Optimization in ASIC Clouds. In Proceedings of http://www.haproxy.org/ the 21st International Conference on Architectural Support for Program- [8] Squid Community. 2018. Squid: Optimising Web Delivery. Retrieved ming Languages and Operating Systems (ASPLOS). 511–526. November 30, 2018 from http://www.squid-cache.org/ [26] Michael E. Kounavis, Xiaozhu Kang, Ken Grewal, Mathew Eszenyi, [9] Cas Cremers, Marko Horvat, Jonathan Hoyland, Sam Scott, and Thyla Shay Gueron, and David Durham. 2010. Encrypting the Internet. In van der Merwe. 2017. A Comprehensive Symbolic Analysis of TLS Proceedings of the Annual Conference of ACM Special Interest Group on 1.3. In Proceedings of the ACM SIGSAC Conference on Computer and Data Communication (SIGCOMM). 135–146. Communications Security (CCS). 1773–1788. [27] Oliver Kowalke. 2018. Boost.Fiber 1.68.0 Overview. Retrieved Novem- [10] Tim Dierks. 2008. The transport layer security (TLS) protocol version ber 30, 2018 from https://www.boost.org/doc/libs/1_68_0/libs/fiber/ 1.2. Technical Report. doc/html/fiber/overview.html [11] Benjamin Erb. 2012. Concurrent programming for scalable web archi- [28] Hugo Krawczyk and Pasi Eronen. 2010. HMAC-based Extract-and- tectures. (2012). Expand Key Derivation Function (HKDF). Technical Report. [12] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, [29] Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Unit- Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, ing CPU and GPU in Information Retrieval Systems for Intra-Query Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Parallelism. In Proceedings of the 23rd ACM SIGPLAN Symposium on Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Principles and Practice of Parallel Programming (PPoPP). 327–337. Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, [30] Peter Membrey, David Hows, and Eelco Plugge. 2012. SSL load balanc- Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth ing. In Practical Load Balancing. Springer, 175–192. Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug [31] Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, and Minlan Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg. 2018. Yu. 2017. SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Azure Accelerated Networking: SmartNICs in the Public Cloud. In Pro- Cheap Using Switching ASICs. In Proceedings of the Annual Conference ceedings of the 15th USENIX Symposium on Networked Systems Design of ACM Special Interest Group on Data Communication (SIGCOMM). and Implementation (NSDI). 51–66. 15–28. [13] OpenSSL Software Foundation. 2018. OpenSSL: Cryptography and [32] David Naylor, Alessandro Finamore, Ilias Leontiadis, Yan Grunen- SSL/TLS Toolkit. Retrieved November 30, 2018 from https://www. berger, Marco Mellia, Maurizio Munafò, Konstantina Papagiannaki, openssl.org/ and Peter Steenkiste. 2014. The Cost of the "S" in HTTPS. In Proceedings [14] The Apache Software Foundation. 2018. The Apache HTTP Server of the 10th ACM International on Conference on emerging Networking Project. Retrieved November 30, 2018 from https://httpd.apache.org/ Experiments and Technologies (CoNEXT). 133–140. [15] Owen Garrett. 2015. NGINX vs. Apache: Our View of a Decade-Old [33] Q-Success. 2018. Usage of Nginx broken down by ranking. Re- Question. Retrieved November 30, 2018 from https://www.nginx.com/ trieved November 30, 2018 from https://w3techs.com/technologies/ blog/nginx-vs-apache-our-view/ breakdown/ws-nginx/ranking [16] Vinodh Gopal, James Guilford, Erdinc Ozturk, Wajdi Feghali, Gil Wol- [34] Qualys. 2018. SSL Pulse. Retrieved November 30, 2018 from https: rich, and Martin Dixon. 2009. Fast and constant-time implementation //www.ssllabs.com/ssl-pulse/ of modular exponentiation. (2009). [35] Amir Rawdat. 2017. Testing the Performance of NGINX and [17] Shay Gueron and Vlad Krasnov. 2015. Fast prime field elliptic-curve NGINX Plus Web Servers. Retrieved November 30, 2018 cryptography with 256-bit primes. Journal of Cryptographic Engineer- from https://www.nginx.com/blog/testing-the-performance-of-nginx- ing 5, 2 (2015), 141–151. and-nginx-plus-web-servers/ [18] Owen Harrison and John Waldron. 2008. Practical Symmetric Key [36] Will Reese. 2008. Nginx: the High-performance Web Server and Re- Cryptography on Modern Graphics Hardware. In Proceedings of the verse Proxy. Linux Journal 2008, 173, Article 2 (2008).

169 QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC,USA

[37] Eric Rescorla. 2018. The transport layer security (TLS) protocol version 1.3. Technical Report. [38] Hovav Shacham and Dan Boneh. 2001. improving SSL Handshake Performance via Batching. In Proceedings of the Cryptographer’s Track at RSA Conference (CT-RSA). 28–43. [39] Hovav Shacham, Dan Boneh, and Eric Rescorla. 2004. Client-side caching for TLS. ACM Transactions on Information and System Security (TISSEC) 7, 4 (2004), 553–575. [40] Mostafa I Soliman and Ghada Y Abozaid. 2011. FPGA implementation and performance evaluation of a high throughput crypto . Journal of Parallel and (JPDC) 71, 8 (2011), 1075–1084. [41] Alibaba Open Source. 2018. tengine qat ssl. Retrieved November 30, 2018 from http://tengine.taobao.org/document/tengine_qat_ssl.html [42] Drew Springall, Zakir Durumeric, and J Alex Halderman. 2016. Mea- suring the Security Harm of TLS Crypto Shortcuts. In Proceedings of the Internet Measurement Conference (IMC). 33–47. [43] Robert Szerwinski and Tim Güneysu. 2008. Exploiting the Power of GPUs for Asymmetric Cryptography. In Proceedings of the 10th International Workshop on Cryptographic Hardware and Embedded Systems (CHES). [44] LiteSpeed Technologies. 2018. Event-Driven vs. Process- Based Web Servers. Retrieved November 30, 2018 from https://www.litespeedtech.com/products/litespeed-web-server/ features/event-driven-architecture [45] Changzheng Wei, Jian Li, Weigang Li, Ping Yu, and Haibing Guan. 2017. STYX: A Trusted and Accelerated Hierarchical SSL Key Man- agement and Distribution System for Cloud Based CDN Application. In Proceedings of the ACM Symposium on Cloud Computing (SoCC). 201–213. [46] Wikipedia. 2018. AES instruction set. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/AES_instruction_set [47] Wikipedia. 2018. Elliptic-curve cryptography. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/Elliptic-curve_cryptography [48] Wikipedia. 2018. Elliptic-curve Diffie–Hellman. Retrieved Novem- ber 30, 2018 from https://en.wikipedia.org/wiki/Elliptic-curve_Diffie- Hellman [49] Wikipedia. 2018. Epoll. Retrieved November 30, 2018 from https: //en.wikipedia.org/wiki/Epoll [50] Wikipedia. 2018. Fiber (computer science). Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/Fiber_(computer_science) [51] Wikipedia. 2018. File descriptor. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/File_descriptor [52] Wikipedia. 2018. Kqueue. Retrieved November 30, 2018 from https: //en.wikipedia.org/wiki/Kqueue [53] Wikipedia. 2018. OpenSSL. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/OpenSSL [54] Wikipedia. 2018. Pseudorandom function family. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/Pseudorandom_function_ family [55] Wikipedia. 2018. RSA (cryptosystem). Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/RSA_(cryptosystem) [56] Wikipedia. 2018. TLS termination proxy. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/TLS_termination_proxy [57] Wikipedia. 2018. Transport Layer Security. Retrieved November 30, 2018 from https://en.wikipedia.org/wiki/Transport_Layer_Security [58] Jason Yang and James Goodman. 2007. Symmetric Key Cryptography on Modern Graphics Hardware. In Proceedings of the International Conference on the Theory and Application of Cryptology and Information Security (ASIACRYPT). 249–264. [59] Shun Yao and Dantong Yu. 2017. PhiOpenSSL: Using the Xeon Phi Coprocessor for Efficient Cryptographic Calculations. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 565–574.

170 PPoPP ’19, February 16–20, 2019, Washington, DC, USA Xiaokang Hu, Changzheng Wei, Jian Li, Brian Will et al.

A Artifact Appendix A.4 Installation A.1 Abstract A complete software environment in the tested server in- This artifact contains the source codes of the two impor- volves the installation of QAT Driver, OpenSSL, QAT Engine tant components of QTLS, i.e., the asynchronous offload and Async Mode Nginx: mode and the heuristic polling scheme. At least one QAT 1. Intel® QAT Driver acceleration device is required to conduct the performance • download the newest driver and guideline from the evaluation. QAT website • install driver according to the guideline A.2 Artifact check-list (meta-information) 2. OpenSSL (1.1.0e or later) • Program: OpenSSL, QAT Engine, Async Mode Nginx • download it from the OpenSSL website • Compilation: gcc • install it according to the README • Binary: Linux executables 3. QAT Engine (v0.5.37 or later) • Run-time environment: CentOS 7.3 or later. • download it from our Github repository • Hardware: QAT acceleration device is required. • install it and configure qat service according to the • Execution: Shell commands. README • Metrics: CPS (Connections per second) for handshake and Throughput (Mbps) for secure data transfer. 4. Async Mode Nginx • Experiments: Full handshake, session resumption and data • download it from our Github repository transfer throughput. • install it according to the README • Publicly available?: Yes, in both Zenodo and Github. • Workflow frameworks used?: No. A.5 Experiment workflow • Start Nginx in the tested server. One can modify the A.3 Description Nginx conf file for different configurations (refer to A.3.1 How delivered the README of Async Mode Nginx for more details). Most of our work has been available as open source. QTLS • Launch benchmarks (e.g., OpenSSL s_timea or Apachebench) includes three components: asynchronous offload mode (the in the client server to evaluate the TLS performance most important part), heuristic polling scheme (an impor- of the running Nginx in the tested server. tant optimization) and kernel-bypass notification scheme (a further optimization). The source codes of the former two A.6 Evaluation and expected result have been placed on Zenodo (DOI: 10.5281/zenodo.2008661, Four configurationsSW ( , QAT+S, QAT+A and QAT+AH) can web link: https://zenodo.org/record/2008661). be evaluated with this artifact. A few notes to compare dif- It’s recommended to refer to our Github repositories for ferent configurations: the latest codes and guidelines. • Pay attention to hyper-threading, which has an obvi- • Intel® QAT OpenSSL Engine: https://github.com/intel/ ous influence on the performance. QAT_Engine • Set CPU core affinity for both Nginx workers andthe • Intel® QAT Async Mode Nginx: https://github.com/ polling threads. Make sure that different configura- intel/asynch_mode_nginx tions employ the same number of cores. The fiber async infrastructure is available in OpenSSL • Multiple benchmark processes may be needed in the official releases since version 1.1.0 series. client server to fully load the running Nginx. After each QAT-related testing, one can check how many A.3.2 Hardware dependencies crypto requests have been processed by the QAT accelerator: Two connected physical servers are required: one as the tested server and one as the client. At least one of the fol- [root@server]# cat /sys/kernel/debug/qat*/fw_counters lowing QAT acceleration devices is required in the tested server: It is expected too see the following performance enhance- • Intel® C62X Series Chipset ment between different configurations: • Intel® Communications Chipset 8925 to 8955 Series • QAT+S: in some cases, it provides higher performance • Intel® Communications Chipset 8900 to 8920 Series compared to the SW configuration; in other cases, it provides similar or even lower performance. A.3.3 Software dependencies • QAT+A: for full handshake, it provides significant per- • GNU C Library version 2.23 or later formance boost compared to the SW configuration; • OpenSSL 1.1.0e or later for most of other cases, it can also provide obvious • Validated in CentOS 7.3 with kernel 3.10. performance enhancement.

171 QTLS: QAT-based TLS Asynchronous Offload Framework PPoPP ’19, February 16–20, 2019, Washington, DC,USA

• QAT+AH: benefiting from the heuristic polling scheme, it brings additional performance enhancement (about 20% for most cases) over the QAT+A configuration.

A.7 Experiment customization We extended the simple engine setting scheme in Nginx to an SSL Engine Framework, which provides flexible and powerful accelerator configuration directly in the Nginx conf file. One can customize the QAT offload behaviors through this framework, such as which crypto algorithms to offload, which polling method to use and the software/hardware switches. Besides, the number of Nginx workers and the TLS cipher suite to use can also be set in the Nginx conf file. An example customization of the QAT offload behaviors is as follows: worker_processes 8; load_module modules/ngx_ssl_engine_qat_module.so; ... ssl_engine { use qat_engine; default_algorithm RSA,EC,DH,PKEY_CRYPTO; qat_engine { qat_offload_mode async; qat_notify_mode poll; qat_poll_mode heuristic; qat_heuristic_poll_asym_threshold 48; qat_heuristic_poll_sym_threshold 24; } } ...

172