Common Counters: Compressed Encryption Counters for Secure GPU Memory

Common Counters: Compressed Encryption Counters for Secure GPU Memory Seonjin Na, Sunho Lee, Yeonjae Kim, Jongse Park, Jaehyuk Huh School of Computing, KAIST {sjna, shlee, yjkim, jspark, jhhuh}@casys.kaist.ac.kr Abstract—Hardware-based trusted execution has opened a isolates them in the GPU-side command processing unit, promising new opportunity for enabling secure cloud computing. protecting critical GPU operations from malicious privileged Nevertheless, the current trusted execution environments are software [50]. It requires to change the GPU command pro- limited to the traditional CPU-based computation, while more and more critical workloads such as machine learning require to cessing subsystem to provide such isolated GPU management use GPUs. For secure GPU computing, recent studies proposed inside a GPU. Alternatively, HIX proposed to protect PCIe to isolate critical GPU operations from the operating system by I/O subsystems from privileged software and protect the GPU providing trusted execution environments for GPUs. However, a driver with a CPU TEE [19]. While HIX does not require missing component in the prior GPU trusted execution studies to change GPUs, its threat model is weaker than Graviton is a hardware-based memory protection technique optimized for GPUs. Start-of-art memory protection techniques for CPUs use since HIX does not provide the protection from direct physical counter-mode encryption, where per-block counters are used attacks. as timestamps for generating one-time pads. The integrity of Although the prior studies proposed the GPU system de- counters is protected by a counter tree. This study investigates signs for the trusted GPU computing, a critical unaddressed the hardware memory encryption for GPUs and identifies counter component is a hardware-based memory protection technique cache misses as one of the key performance bottlenecks. To mitigate the overheads of counter cache misses, this paper optimized for GPUs. Graviton relies on the availability of proposes a novel technique, called common counters, for efficient 3D stacked DRAM for GPUs, and assumes that the memory counter-mode encryption and counter integrity protection. The contents in the 3D stacked DRAM cannot be leaked or proposed technique exploits unique characteristics of common compromised. However, not only do many discrete GPUs use GPU applications. In GPU applications, a large portion of GDDRx memory, but also integrated GPUs use conventional application memory is written only once by the initial data transfer from the CPU memory, or the number of writes by the DDRx memory. Therefore, to provide general supports for GPU applications to major data structures tends to be uniform. GPU TEEs, the hardware memory protection must be available Exploiting the uniform write property, this study proposes to for GPU memory as well. However, the existing hardware- use a common counter representation to complement the existing based memory protection schemes require costly encryption per-block encryption counters. With the common counters, the and integrity verification. Figure1 (a) depicts the secure GPU proposed technique can almost eliminate the performance overheads caused by counter cache misses, reducing the degradation execution system proposed in Graviton [50] with the trusted to 2.9%. stacked memory, and Figure1 (b) shows our target GPU with conventional untrusted memory. I. INTRODUCTION Recently, there have been significant improvements of Hardware-based trusted execution environments (TEEs) hardware-based memory protection for trusted CPUs [11], such as Intel Software Guard Extension (SGX) [17] and ARM [37], [38], [40], [46], [53], [54]. The state-of-art memory TrustZone [1] provide secure execution of user applications on encryption uses a counter-based encryption, which generates untrusted remote cloud servers. Such hardware TEEs open a a one-time pad (OTP) for a given encryption key, address, new direction for trusted clouds by isolating and protecting and counter. Each cacheline unit of memory has a separate user execution environments from compromised privileged encryption counter, which is incremented for every data write- software and physical attacks. Nevertheless, a critical limi- back to the memory, guaranteeing its freshness. The counters tation of the current TEE supports is the lack of consideration are cached in a counter cache, and as long as the counter for for widely used GPU-based computation. Along with a wide data missed in the last-level cache (LLC) is available in the deployment of machine learning (ML) workloads, the GPU on-chip counter cache, the OTP for the missed data can be computation has become an essential part of computing sys- prepared before the data returns from the memory, requiring tems, especially in clouds providing ML-based services. To a simple XOR operation for decryption with the OTP. The protect the essential GPU computation, the scope of TEEs integrity validation uses a message authentication code (MAC) must be extended to GPU computation. for each cacheline data, and the verification of counter values To address the lack of GPU TEEs, recent studies proposed employs a tree of counters. hardware-based hardening techniques for such GPU execu- However, even with the recent improvements for the proces- tion environments [19], [50]. Graviton detaches the critical sor memory encryption and integrity protection, the protection GPU management operations from the CPU-side driver, and of GPU memory is yet to be investigated. In our study, we Trusted Untrusted Our focus counter value is obtained from a small number of on-chip common counters for the majority of LLC misses, counter CPU GPU I/O cache misses are reduced significantly. Enclave Interconnect Stacked To enable common counters, the GPU context for each App Memory application must use a different memory encryption key since all the counters for the application memory must be reset (a) GDDRx memory during the context creation. Once a GPU context is created by GPU CPU the trusted GPU command processor, memory pages for the I/O Crypto Attacker Enclave context are encrypted by a per-context encryption key, and Interconnect Engine Encrypted Encryption Data App Integrity & Integrity counter values for allocated memory blocks for the context Engine Verification Security Metadata are reset. This counter resetting does not violate the security requirement, since the memory pages of a new context is (b) encrypted by a new key. Fig. 1: (a) Trusted GPU execution in Graviton [50], exclusively Our evaluation based on the simulation with GPGPU- designed for GPUs equipped with trusted stacked memory, Sim shows that the proposed counter management technique and (b) the proposed memory protection technique that pro- can almost eliminate the performance overheads by counter vides confidentiality and integrity for untrusted conventional cache misses. When combined with the prior MAC opti- GDDRx memory in more general systems. mization [39], COMMONCOUNTER reduces the performance degradation by memory protection to 2.9% for a set of GPU employs the trusted execution model for GPUs proposed by applications, while the latest improvement using Morphable Graviton [50], but the memory is untrusted with conventional counters [38] can impose 11.5% degradation with the same GDDRx. To the best of our knowledge, this study is the MAC optimization. first study to assess the performance impact of hardware As the first study for the hardware-based GPU memory GPU memory protection and to propose a novel technique to protection, the contributions of this paper are as follows: mitigate the performance degradation. Our investigation shows • This paper applies the latest existing CPU-based mem- that counter cache misses can cause a significant performance ory protection techniques to the trusted GPU execution degradation for GPU applications, which have low spatial model. It identifies that counter cache misses are one localities for counter blocks. As GPUs are employed for their of the key sources of performance degradation in secure high performance, the performance overheads for security GPU execution. might be delaying fast and wide adoption of trusted computing • To reduce the overheads of counter cache misses, the pa- for GPUs. per exploits read-only data abundant in GPU applications. To mitigate the performance degradation by memory protec- • The paper identifies a unique update behavior of counter tion, this paper proposes a novel counter representation tech- values in GPU applications, where the majority of counter nique for GPUs, called COMMONCOUNTER, exploiting unique values often progress with a few common values within characteristics of GPU applications. For GPU applications, a an application. large portion of the application memory is written once by the • The paper proposes an efficient mechanism to identify initial data transfer from the CPU memory. In addition, the common counters and to provide common counters for execution of many GPU kernels tends to generate the uniform LLC miss handling, which allows effective bypassing of number of writes to each element of major data structures, the counter cache. and thus after a kernel execution, the counter value of each The rest of the paper is organized as follows. SectionII memory block of a common data structure tends to have the describes the trusted GPU execution and the latest improve- same value. Therefore, when a GPU application consists of ments for memory protection in CPU memory. Section III multiple execution

Load more