<<

NVCool: When Non-Volatile Caches Meet Cold Boot Attacks*

Xiang Pan∗, Anys Bacha†, Spencer Rudolph∗, Li Zhou∗, Yinqian Zhang∗, Radu Teodorescu∗ ∗The Ohio State University {panxi, rudolph, zholi, yinqian, teodores}@cse.ohio-state.edu †University of Michigan [email protected]

Abstract—Non-volatile memories (NVMs) are expected to re- Despite all their expected benefits, non-volatile memories place traditional DRAM and SRAM for both off-chip and on- will also introduce new security vulnerabilities as data stored chip storage. It is therefore crucial to understand their security in these memories will persist even after being powered-off. vulnerabilities before they are deployed widely. This paper shows that NVM caches are vulnerable to so-called “cold boot” In particular, non-volatile memory is especially vulnerable attacks, which involve physical access to the processor’s cache. to “cold boot” attacks. Cold boot attacks, as first proposed SRAM caches have generally been assumed invulnerable to cold by Halderman et al. [13], use a cooling agent to lower the boot attacks, because SRAM data is only persistent for a few temperature of DRAM chips before physically removing them milliseconds even at cold temperatures. from the targeted system. Electrical characteristics of DRAM Our study explores cold boot attacks on NVM caches and capacitors at very low temperatures cause data to persist in the defenses against them. In particular, this paper demonstrates that hard disk keys can be extracted from the NVM cache chips for a few minutes even in the absence of power. This in multiple attack scenarios. We demonstrate a reproducible allows the attacker to plug the chips into a different machine attack with very high probability of success. This paper also and scan the memory image in an attempt to extract secret proposes an effective -based countermeasure that can information. When the memory is implemented using NVM, completely eliminate the vulnerability of NVM caches to cold the cold boot attacks become much simpler and more likely boot attacks with a reasonable performance overhead. to succeed, because data persists through power cycles. To protect against cold boot attacks on main memory, I.INTRODUCTION one approach is to encrypt sensitive data [14]–[21]. Another Non-volatile memory (NVM) such as Spin-Transfer Torque approach is to keep secret keys stored in SRAM-based CPU Random Access Memory (STT-RAM), Phase Change Memory registers, caches, and other internal storage during system ex- (PCM), and Resistive Random Access Memory (ReRAM), ecution [22]–[31]. The rationale behind this design philosophy is a promising candidate for replacing traditional DRAM is that cold boot attacks against on-chip SRAM structures and SRAM memories for both off-chip and on-chip storage are deemed to be extremely difficult. This is because SRAM [1]. NVMs in general have several desirable characteristics data persistence at cold temperatures is limited to a few including non-volatility, high density, better scalability at small milliseconds [32]. feature sizes, and low leakage power [2]–[4]. Prior work has The security implications of implementing the main mem- examined the performance, energy, and reliability implications ory and microprocessor caches with NVM have received little of NVM register files, caches, and main memories [2]–[8]. attention. While memory encryption schemes are feasible, The semiconductor industry is investing heavily in NVM cache encryption is not practical due to low access latency technologies and companies such as Everspin [9] and Crossbar requirements. Cold boot attacks on unencrypted NVM caches [10] are focused exclusively on NVM technologies and have will be a serious concern in practice, especially given the produced NVM chips that are being sold today. A joint effort ubiquity of smart mobile and Internet of Things (IoT) devices, from and Micron has yielded 3D XPoint [11], a new which are more exposed to physical tampering — a typical generation of NVM devices with very low access latency and setup for cold boot attacks. high endurance, expected to come to the market this year. This work examines the security vulnerabilities of micro- Hewlett Packard Enterprise’s ongoing “The Machine” project processors with NVM caches. In particular, we show that [12] is also set to release computers equipped with memristor encryption keys can be retrieved from NVM caches if an technology (also known as ReRAM) as part of enabling attacker gains physical access to the device. Since removing highly scalable memory subsystems. The industry expects non- the processor from a system no longer erases on-chip mem- volatile memory to replace DRAM off-chip storage in the near ory content, sensitive information can be leaked. This paper future, and SRAM on-chip storage in the medium term. demonstrates that AES keys can be identified in the NVM caches of a simulated ARM-based system running This work was supported in part by the National Science Foundation under the Ubuntu Linux OS. grants CCF-1253933 and CCF-1629392, and The Ohio Supercomputer Center. We have examined multiple attack scenarios to evaluate the probability of a successful attack depending on the system activity, attack timing, and methodology. In order to search for AES keys in cache images, we adopted the key search algorithm presented in [13] for main memories, and made the necessary modifications to target caches which cover non- Fig. 1: AES-128 key schedule with details on key expansion. contiguous subsets of the memory space. We find that the probability of identifying an intact AES key if the processor is stopped at any random point during execution ranges between following subkeys (also known as round keys) are computed 5% and 100%, depending on the workload and cache size. We from the previously generated subkeys using publicly known also demonstrate a reproducible attack with 100% probability functions such as Rot-Word, Sub-Word, Rcon, EK, and K of success for the system we study. defined in the key expansion algorithm [34]. In each round To counter such threats, this paper proposes an effective of the key generation, the same sequence of functions will be software-based countermeasure. We patch the Linux kernel to executed. Each newly generated subkey will have the same allocate sensitive information into designated memory pages size, e.g. 16 bytes in AES-128. This process repeats a certain that we mark as uncacheable in their page table entries number of rounds (e.g. 10 rounds in AES-128) until the (PTEs). This way secret information will never be loaded expanded key is completely generated. Each subkey from the into vulnerable NVM caches but only stored in main memory expanded key will then be used in separate rounds of AES and/or hard disk, which can be encrypted with a reasonable encryption or decryption algorithms. performance cost. The performance overhead of this coun- termeasure ranges between 2% and 45% depending on the B. ARM’s Cryptographic Acceleration hardware configuration. Many of today’s processors include vectorization support in Overall, this paper makes the following main contributions: the form of Single Instruction Multiple Data (SIMD) engines. • The first work to examine cold boot attacks on non- In ARM processors, the SIMD engine, called NEON, consists volatile caches. of 32 128-bit registers (v0 - v31). Each NEON register can • Two types of cold boot attacks have been performed and conveniently load a 128-bit AES key or a full round of the shown to be effective on non-volatile caches. AES key schedule into a single register obviating the need • A software-based countermeasure has been developed for multiple accesses to the cache or the memory subsystem. and proven to be effective. In addition, ARMv8 introduces cryptographic extensions that • An algorithm for identifying AES keys in cache images include new instructions that can be used in conjunction with has been implemented. NEON for AES, SHA1, and SHA2-256 algorithms. In the AESD The rest of this paper is organized as follows: Section II case of AES, the available instructions are: for single AESE AESIMC provides background information and discusses related work. round decryption, for single round encryption, VMULL Section III explains the threat model. Section IV illustrates for inverse mix columns, and for polynomial multiply our cache-based AES key search algorithm. Section V de- long. In this paper, we explore the use of the NEON engine, scribes our experimental methodology. Section VI presents in addition to the ARMv8 cryptographic extensions. and evaluates two types of cold boot attacks. Section VII C. Cold Boot Attacks and Defenses presents countermeasures and evaluates their effectiveness and performance overhead. Finally, section VIII concludes. The idea of cold boot attacks in modern systems was first explored by Halderman et al. [13]. Their work consisted of II.BACKGROUNDAND RELATED WORK extracting disk encryption keys using information present in In this section we provide some background information main memory (DRAM) images preserved from a . The relevant to our study and discuss related work in the context idea of this type of attack builds on the premise that under low of cold boot attacks. temperature conditions, DRAM chips preserve their content for extended time durations. The attack also relies on the A. Advanced Encryption Standard (AES) fact that AES keys can be inferred by examining relationships The Advanced Encryption Standard (AES) [33] is a sym- between subkeys that involve computing the hamming distance metric block cipher that encrypts or decrypts a block of 16 information. The AES key search algorithm has been proposed bytes of data at a time. Both encryption and decryption use the in [13] as well. Muller et al. [35] later expanded cold boot same secret key. The commonly used AES key size is either attacks to mobile devices. When volatile memories such as 128-bit (AES-128), 192-bit (AES-192), or 256-bit (AES-256). SRAM and DRAM are replaced by non-volatile ones (e.g. Before performing any encryption/decryption operations, the STT-RAM, PCM, and ReRAM) in future computers, cold boot secret key must be expanded to a key schedule consisting attacks will become much easier to perform since data will of individual subkeys that will be used for different internal be indefinitely preserved after cutting off power supply for rounds of the AES (Rijndael) algorithm. The key expansion several years without the need for any special techniques such process is shown in Figure 1. The key schedule starts with the as cooling. Our work is the first work to study cold boot attacks original secret key, which is treated as the initial subkey. The in the context of non-volatile caches. Prior work has proposed encrypting memory data in order to prevent attackers from easily extracting secret information from main memory [14]–[21]. Although this approach is effective for main memory, encryption techniques are chal- lenging to apply to caches because of their large performance overhead. Other researchers, proposed storing secret keys away from main memory in CPU registers, caches, and other internal storage during system execution [22]–[28], [30], [31]. However these proposed approaches have not considered the vulnerability of data stored in CPU caches, especially since keys stored in CPU registers and other internal storage can still be fetched into caches during execution [23]–[25], [28], [30].

III.THREAT MODEL Fig. 2: Cache view with (a) complete and (b, c) incomplete The threat model assumed in this study is consistent with AES key schedules stored in disjoint 64-byte lines. prior work on “cold boot” attacks. In particular, we assume the attacker gains physical access to the target device (e.g. an IoT device, a , a laptop, or a desktop computer). The algorithm proposed by Halderman et al. can, therefore, Further, the attacker is assumed to have the ability to extract simply scan the entire memory image sequentially to search for the microprocessor or system from the device potential AES keys [13]. However, neither the completeness and install them into a debugging platform, which allows of the key schedule nor the contiguity of the memory space cache data to be accessed. In practice, such a platform is can be assumed in the case of caches. not hard to obtain. Many microprocessor manufacturers offer debugging and development platforms that allow a variety of Non-contiguous memory space. Caches only capture a small access functions, including functionality to retrieve the cache non-contiguous subset of the memory space. Since cache content. lines are typically only 32-128 bytes, data that was originally For example, for the ARM platform, the manufacturer stored in physically-contiguous memory is not necessarily offers the DS-5 development software [36] and associated stored in contiguous cache regions. Therefore, the logically hardware DSTREAM Debug and Trace unit [37]. These tools sequential AES key schedules, typically 176 to 240 bytes, can enable debugging and introspection into ARM processor-based be separated into multiple physically disjoint cache lines as hardware. The attacked microprocessor can be plugged into shown in Figure 2(a). a development board such as the Juno ARM Development Incomplete key schedules. Another relevant cache property Platform [38] either directly or through the JTAG debugging is that data stored in the cache is subject to frequent replace- interface. In the DS-5 software, the Cache Data View can be ments. Parts of a complete AES key schedule can be missing used to examine the contents of all levels of caches and TLBs. from the cache, which makes our key search more difficult Information such as cache tags, flags, and data associated with to conduct. Examples of these situations are shown in Figure each cache line, as well as the index of each cache set, can 2(b) and 2(c). Particularly, in Figure 2(b), the cache line that be read and then exported to a file for further processing. holds the RK-2, RK-3, RK-4, and RK-5 has been evicted from the cache. IV. CACHE-AWARE AESKEY SEARCH B. Search Algorithm Design In this paper, we study the security of NVM-based mi- croprocessor caches specifically by demonstrating AES key To address these issues our algorithm will first reconstruct extraction attacks under the aforementioned threat model. the cache image by sorting cache lines by their physical addresses that we extract from the cache tags and indexes, A. Technical Challenges and then feed the reconstructed cache image to the AES key An algorithm for identifying AES keys in a main memory search function. In this way the logically contiguous data will image has been presented by Halderman et al. in their seminal still be contiguous in our reconstructed cache image regardless work on cold boot attacks [13]. Its application to caches, of the cache architecture. however, is not straightforward. Particularly, the AES key QuickSearch. Our key search algorithm runs through each search algorithm in Halderman et al. [13] assumes that a com- key schedule candidate (all cache words) and first attempts to plete AES key schedule is stored in a physically-contiguous validate the first round of the key expansion (16 bytes). If there memory region. This is a relatively safe assumption in their is a match between the first round expansion of the candidate case since the size of memory pages on modern computers are key and the data stored in the cache, the candidate key is at least 4KB and a complete AES key schedule is 176 bytes validated. As long as the key itself followed by one round (16 (128-bit key/AES-128) to 240 bytes (256-bit key/AES-256). bytes) of the expanded key exists in the cache, our algorithm can successfully detect the key as shown in Figure 2(b). We call this variant of the key search algorithm, QuickSearch. DeepSearch. The AES key schedule is stored on multiple cache lines since it is larger (at least 176 bytes for AES- 128 mode) than the typical cache block (32-128 bytes). Cache evictions can displace parts of the AES key schedule from the cache, including the first round of the key expansion, which our QuickSearch algorithm uses to validate the key. These cases are rare since they require the memory alignment to be such that the encryption key falls at the end of a cache line and the first round of the key expansion is on a different line. To deal with these cases we designed a more in-depth algorithm (which we call DeepSearch) that considers multiple rounds of the key expansion. In this implementation, as long as the Fig. 3: Details on AES implementation-dependent modifica- key itself is inside the cache and there exist two consecutive tions to the key search algorithm. rounds of expanded keys, our algorithm can find the key as shown in Figure 2(c). The downsides of DeepSearch is that it Hardware Configuration Cores 8 (out-of-order) runs considerably slower than QuickSearch and the search has ISA ARMv8 (64-bit) some false positives. Frequency 3GHz IL1/DL1 Size 32KB C. Implementation-Specific Considerations IL1/DL1 Block Size 64B IL1/DL1 Associativity 8-way We demonstrate the AES key extraction attack against IL1/DL1 Latency 2 cycles dm-crypt, a cryptographic module that is used in main- Coherence Protocol MESI stream Linux kernels. As will be explained in Section V, the L2 Size 2, 4, 8 (default), and 128MB L2 Block Size 64B specific target of our demonstrated attacks is the disk encryp- L2 Associativity 16-way tion/decryption application, LUKS (Linux Unified Key Setup), L2 Latency 20 cycles of the Ubuntu OS, which by default invokes the dm-crypt Memory Type DDR3-1600 SDRAM Memory Size 2GB kernel module for disk encryption/decryption using AES-XTS Memory Page Size 4KB mode [39]. Memory Latency 300 cycles In the AES implementation of dm-crypt the decryption Disk Type Solid-State Disk (SSD) key schedule is different from the encryption one. We illustrate Disk Latency 150us the key schedule for the decryption process in Figure 3(a). The TABLE I: Summary of hardware configurations. decryption key schedule is first reversed in rounds from the encryption key schedule. An inverse mix column operation is then applied to rounds 1 through 9 of the key schedule. As a architecture for its broader adoption in mobile devices which result, we need to perform searches for encryption keys and are particularly vulnerable to the physical access required by decryption keys separately. Specifically, before searching for cold boot attacks. Our results should be generally applicable decryption keys we first convert the candidate schedules back to other microprocessors. to the encryption key schedule format. We simulated a 2-level cache hierarchy with private L1 Another artifact that affects our key search algorithm is instruction and data caches for each core and a shared inclusive specific to little-endian machines, which store the least sig- L2 cache as the last level cache (LLC). Our cache configura- nificant byte in the lowest address. dm-crypt adopts an tion parameters are in line with the ones used in many modern implementation which stores the AES key schedules as an computer systems. Since our key search algorithm focuses on array of words (e.g. 4 bytes) instead of bytes. This leads to a the LLC, we experimented with different LLC sizes from 2MB mismatch in the representation on little-endian architectures, to 128MB. We examine a wide range of LLC sizes from small as shown in the first line of Figure 3(b). Our search algorithm (2MB) to very large (128MB), with most experiments con- takes into account this artifact and converts the little-endian ducted using an 8MB LLC. The main configuration parameters representation to big-endian before conducting the key search. of our simulated system are summarized in Table I. The system is configured to run Ubuntu 14.04 Trusty 64- V. EXPERIMENTAL METHODOLOGY bit . We installed the cryptsetup application We used a full system simulator to conduct our experiments - LUKS (Linux Unified Key Setup) in the Ubuntu OS and since microprocessors with NVM caches are not currently used it with the dm-crypt module in Linux kernel to available in commercial systems. Specifically, we modeled encrypt a 4GB partition of a simulated hard drive. The disk an 8-core ARMv8-based processor using the gem5 simulator encryption/decryption algorithm we configured for LUKS was [40]. Cold boot attacks have been demonstrated on both AES-XTS [39] with 128-bit keys. The XTS format is currently and ARM architectures in the past. We used the ARMv8 the standard form for Linux since ECB/CBC format has known Mixed Benchmark Groups NVCool Experiments compute- calculix, dealII, gamess, gromacs, System without ARM’s cryptographic mixC NoNEON bound h264ref, namd, perlbench, povray acceleration support memory- astar, cactusADM, GemsFDTD, lbm, System with ARM’s cryptographic mixM NEON bound mcf, milc, omnetpp, soplex acceleration support compute/ dealII, gamess, namd, perlbench, Geometric mean of single-threaded mixCM STAvg memory astar, cactusADM, lbm, milc benchmarks from SPEC CPU2006 TABLE II: Detailed makeup of the mixed benchmark groups. TABLE III: Experiment names and short description. security flaws [41]. One detail to note here is that AES- cryptographic acceleration (NEON) support. For easy refer- XTS uses a dual encryption/decryption method which requires ence, Table III summarizes the labels we use for different two different AES keys to perform encryption/decryption experiments. operations [33]. Our experiments take this into account and To better analyze results we classify the SPEC CPU2006 only consider the key search a success when both keys are benchmarks into two categories — compute-intensive and found. memory-intensive. We can see from the results in Figure 4 We ran the SPEC CPU2006 benchmark suite with both that when running compute-bound benchmarks the probability binaries and data stored in the encrypted hard drive to of finding AES keys in the cache is higher than when running simulate applications that run on the target system. SPEC memory bound benchmarks. On average, there is a 76% CPU2006 includes integer and floating-point single-threaded probability of finding AES keys in the system without NEON benchmarks, with a mix of computation-bound and memory- support when running compute-bound benchmarks and a 26% bound applications [42]. probability when running memory-bound benchmarks. To keep the simulation time reasonable, we use the check- When the system is configured with NEON support, the point functionality provided by gem5 [40] to bypass the OS probability of finding the key in the cache drops for both boot-up phase and ran each benchmark with up to 1 billion classes of benchmarks to 41% and 14% respectively. This is instructions. For experiments which require periodically taking because the NEON technology stores encryption keys in vector LLC image snapshots we use a sampling interval of 1 million registers that are large enough to hold the entire key schedule. instructions. To further test our attack scenario and coun- These registers are also infrequently used for other functions termeasure approach in a multi-programmed/multi-threaded which means they don’t have to be spilled to memory (and environment, we also ran several groups of mixed benchmarks cache). As a result, in processors with NEON support the from SPEC CPU2006 - mixC, mixM, and mixCM. As detailed encryption key is read from memory much less frequently, in Table II, mixC contains 8 computation-bound benchmarks, leading to better resilience to this type of attack. A typical mixM contains 8 memory-bound benchmarks, and mixCM case is seen in perlbench with 96% for NoNEON and 49% for contains 4 benchmarks from mixC and another 4 benchmarks NEON. However there are also exceptions as seen in povray from mixM. (100% for both systems) and gobmk (97% for NoNEON and 4% for NEON). On average, the probability of finding the key VI.VULNERABILITY ANALYSIS in this random attack is 40% for the system without NEON We examine the probability of successfully retrieving disk support and 22% for the system with NEON support. encryption keys from a processor’s last level cache under two There are two principal factors that affect the probability the attack scenarios. encryption key is found in the cache at random points during execution. The first is how recent the last disk transaction A. Random Information Harvesting was, since encrypted disk access requires the AES key to be We first investigate an attack scenario in which the attacker brought into the cache. The second factor is the cache miss gains access to the target machine and disconnects the pro- rate, since the higher the churn of the data stored in the cache cessor from the power supply at an arbitrary point during the the sooner an unused key will be evicted. Computation bound execution. We make no assumptions that the attacker has a way benchmarks in general have a smaller memory footprint so to force a certain code sequence to execute. This is typical, for their cache miss rates are lower, allowing keys to reside in the instance, when a defective device that stores sensitive data is cache for longer. Memory bound benchmarks, on the other discarded without proper security measures. Another example hand, have a larger memory footprint associated with a higher is when an attacker steals a device and physically disconnects cache miss rate, therefore evicting keys more frequently. its battery or power supply before removing the processor. Figure 5 illustrates these effects for selected benchmarks To study the probability of success for such an attack we running on systems without NEON support, showing for each take periodic snapshots of the LLC, at 1 million instruction cache snapshot over time whether the key was found or not (1 intervals. We then run the QuickSearch key search algorithm or 0). The figure also shows the cumulative miss rate of the on each cache snapshot. Figure 4 shows the probability of LLC over the same time interval. finding the AES keys in the 8MB last level cache for different Benchmark dealII shown in Figure 5a is a good illustration benchmarks. We examine systems with and without ARM’s of the behavior of a compute-bound application. The disk NoNEON NEON 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Probability of Finding AES Key 0 calculixdealIIgamessgromacsh264refnamdperlbenchpovraysjeng specrandfspecranditonto wrf GeoMeanCastar bwavesbzip2cactusADMgcc GemsFDTDgobmkhmmerlbm leslie3dlibquantummcf milc omnetppsoplexsphinx3XalanzeusmpGeoMeanMGeoMeanmixCmixMmixCM

Computation Bound Memory Bound Fig. 4: Probability of finding AES keys in 8MB LLC with various types of benchmarks.

Probability Miss-Rate 1 0.5 while the miss rate is low. The miss rate, however, spikes 0.45 as the application begins processing large data blocks for 0.4 compression. This evicts the key from the cache. A disk 0.35 operation causes the key to be brought into the cache again, 0.3 0.25 but the consistently high miss rate causes the key to be evicted 0.2 shortly after that. Even though later in the execution the cache 0.15 miss rate drops, the lack of disk accesses keep the keys away

0.1 LLC Overall Miss-Rate from the cache for the rest of this run. Note that for clarity 0.05 Probability of Finding AES Key 0 0 we only show a fraction of the total execution. 0 200 400 600 800 1000 We also collect results for multi-program mixed workloads Timeline to increase system and disk utilization and cache contention. (a) dealII The results are included in Figure 4 as mixC, mixM, and

Probability Miss-Rate mixCM. The benchmark applications included in each mix 1 0.5 are listed in Table II. As expected, when the system is fully 0.45 0.4 utilized, with one application running on each core (for a total 0.35 of 8), the probability of finding the key increases. This is 0.3 because each application accesses the disk and those accesses 0.25 occur at different times, causing the encryption key to be 0.2 0.15 read more frequently. The compute bound mixC shows 100%

0.1 LLC Overall Miss-Rate probability of finding the key for both systems (with and 0.05

Probability of Finding AES Key without NEON support). 0 0 0 200 400 600 800 1000 While a system with high utilization is clearly more vulner- Timeline able, a mitigating factor is that cache contention is also higher (b) bzip2 when many threads are running. As a result, cache miss rates are also higher and the key may be evicted more frequently. Fig. 5: AES key search trace showing the outcome of the key This is apparent when we examine the memory-bound mixM search and the cumulative LLC miss rate information for (a) workload which shows a 66% success rate without NEON dealII and (b) bzip2 benchmarks running on systems without support and 76% with NEON support. Even with the higher ARM’s cryptographic acceleration support (NoNEON). miss rate, the fully-loaded system is clearly more vulnerable than lightly-loaded system, as seen in the single-threaded experiments. When a mix of both compute and memory bound encryption key is brought into the cache early in the execution applications is used (mixCM) the probability of finding the key as the application accesses the disk to read input data. The miss is 85% for NoNEON and 82% for NEON. rate is low throughout the execution of the application which We also note that the NEON-accelerated system is almost means the key is never evicted and the probability of finding as vulnerable as the system without hardware acceleration the key while running this application is 100%. when the system is running the mix workloads. This is likely Figure 5b shows the behavior of a memory bound ap- caused by the more frequent spills of the vector registers plication, bzip2. Keys are brought into the cache early in when context switching between the kernel thread running the the execution and remain in the cache for a period of time encryption/decryption process and the user threads. Spilling Keys exist in cache after power-off? 2MB 4MB 8MB 128MB Mode Command 2MB 4MB 8MB 128MB 1 Normal Power-off poweroff (-p) N N Y Y 0.9 Forced Power-off poweroff -f Y Y Y Y 0.8 0.7 TABLE IV: Summary of targeted power-off attack results. 0.6 0.5 0.4 root@aarch64-gem5:/# poweroff 0.3 Session terminated, terminating shell...exit 0.2 ...terminated. Probability of Finding AES Key 0.1 * Stopping rsync daemon rsync [ OK ] 0 // 1 STAvg-NoNEONSTAvg-NEONMixC-NoNEONMixC-NEONMixM-NoNEONMixM-NEONMixCM-NoNEONMixCM-NEON * Asking all remaining processes to terminate...[OK] // 2 Fig. 6: Overall probability of finding AES keys in LLCs with * All processes ended within 1 seconds... [ OK ] various sizes. // 3 * Deactivating swap... [ OK ] // 4 * Unmounting local filesystems...[OK] the vector registers holding the encryption key increases reads // 5 and writes of the key to and from memory, exposing it to the * Stopping early crypto disks... [ OK ] cache more frequently. // 6 Figure 6 shows the overall probability of finding AES keys * Will now halt // 7 [ 604.955626] : System halted in systems with different LLC sizes. As expected, larger caches increase the system vulnerability to this type of attack. We can see that as cache size increases the probability of success Fig. 7: The sequence of events triggered by the poweroff also increases across all the benchmarks. With 2MB cache command. the average probability of finding AES keys is from 4.6% to 85% depending on the system and application; for a 128MB LLC the probability of a successful attack ranges from 70% The encryption keys are used in unmounting the encrypted to 100%. Increasing the cache size reduces capacity misses disk drive and they will again appear in the cache. therefore increasing the fraction of time the key spends in the We experimented with two power-off methods in the eval- cache. uation - normal and forced. We examine the probability of successfully identifying the key under the two scenarios. Table IV summarizes the results of our power-off attacks on various B. Targeted Power-off Attack LLC sizes. The second attack scenario we consider is one in which We can see from the results that starting from an LLC size the attacker is able to trigger a graceful or forced power- of 8MB keys will remain in the cache no matter which power- off sequence before physically removing the processor. In off method is used. For the smaller cache sizes like 2MB or this attack scenario, the attacker aims to use the power-off 4MB, after keys are brought into the cache, other operations sequence to ensure the disk encryption keys are brought to which don’t involve encryption disk accesses, may evict the the cache. Since the power-off sequence involves unmounting AES keys. Therefore we won’t see the keys after the system is the disk, this results in a series of encryption/decryption powered off. However for larger caches with 8MB or 128MB transactions that will bring the encryption key into the cache. the keys will stay in the cache after system shutdown. During the system shutdown process, the attacker can stop Forced power-off is different from normal power-off in that the execution at any time to search for secret keys or simply it doesn’t power-off the system in a graceful way. This means wait until the device is completely powered off to examine forced power-off will only perform the action of powering off cache images for secret keys. The goal of this attack is to find the system. However in order to power off the system the a reproducible and reliable way to obtain the encryption key local filesystems are still going to be unmounted to prevent from a compromised system. data loss. In this process the keys will be brought into the Figure 7 shows the sequence of operations executed after cache and stay in the cache after system is powered off in all running the poweroff command. There are two operations cache sizes we examined in the experiments. Forced power- in the power-off sequence (highlighted in green) that will bring off attacks virtually guarantee that the system we investigate, disk encryption keys to the cache. The first (operation 2) is the in all configurations, will expose the secret keys in the NVM operating system asking all remaining processes to terminate. cache. This shows that a potential attacker has a reliable and In this operation the process in charge of disk encryption will reproducible mechanism to compromise disk encryption keys. be terminated. This will invoke the sync system call to flush Figure 8 shows the presence of the disk encryption keys in data from the page cache to the encrypted disk which requires the cache throughout the normal power-off sequence for the reading the AES keys. Before the system is actually powered NoNEON and NEON systems, for different LLC sizes. We can off, all filesystems must be unmounted as shown in step 5. see that the encryption key appears in the cache at roughly the NoNEON NEON after a new copy is created. As such, we only focus on the final 1 memory location where the key is stored which is tracked by crypto_aes_ctx

2MB the structure. In our solution, we devise a

Probability 0 countermeasure that is applicable to kernels that are configured 1 to utilize hardware acceleration, as well as default kernels

4MB configured for environments where such acceleration support

Probability 0 is unavailable. 1 Systems with hardware cryptographic support. Mod- 8MB ern systems usually make use of hardware acceleration Probability 0 1 for cryptographic operations. We assume the kernel is built with the AArch64 accelerated cryptographic algo- 128MB

Probability rithms that make use of NEON and AES cryptographic 0 0 100 200 300 400 500 600 extensions defined in the ARMv8 instruction set. This is Timeline done by including the CONFIG_CRYPTO_AES_ARM64_*, Fig. 8: AES key search sequence of normal power-off from CONFIG_ARM64_CRYPTO, and KERNEL_MODE_NEON ker- start to completion with various sizes of LLCs. nel parameters as part of the build. This translates to us- ing architecture specific cryptographic libraries defined in /arch/arm64/crypto of the Linux source. In order to same time following operation no. 2 in the power-off sequence eliminate the presence of cryptographic keys in the cache, (Figure 7). It is then quickly evicted in the 2MB LLC system, our solution involves marking the page associated with the but persists for increasingly longer time intervals as the size address of the crypto_aes_ctx structure as uncacheable. of the LLC increases. For the 128MB cache, the key is never We implement the necessary changes for this approach within evicted before the system halts. The key is again read into the xts_set_key() routine located in aes-glue.c where cache following operation no. 5 (unmounting the file system). we walk through the page table in search of the appropriate In the 2MB and 4MB cases the key is again evicted before page table entry that maps to the designed page. Once we the system halts. Even for these systems an attacker could locate the correct PTE, we set the L_PTE_MT_UNCACHED force the key to remain in the cache in a predictable way. The flag to label the page as uncacheable. attacker would simply have to trigger the power-off sequence Systems without hardware cryptographic support. If the and then disconnect the processor from the power supply after kernel lacks support of accelerated cryptographic hardware, we a predetermined time period before the key is evicted. Since use the default cryptographic library defined in the /crypto the power-off sequence is fairly deterministic, this approach directory of the Linux source. This boils down to modify- has a high probability of success. ing the crypto_aes_set_key() in aes_generic.c. However, we use a similar approach to the one de- VII.COUNTERMEASURE scribed previously by marking the page which contains the In order to mitigate threats of cold boot attacks against crypto_aes_ctx structure to be uncacheable. The primary NVM caches, we propose a simple and effective software- difference is that encryption and decryption that are used based countermeasure. Our countermeasure is designed to in aes_encrypt() and aes_decrypt() respectively do force secret keys to be stored only in encrypted main memory not make use of the 128-bit NEON registers. As such, the and bypass the NVM cache throughout the execution. We performance impact with this approach is higher since multiple develop a software API for declaring memory blocks as fetches of the expanded key from memory are needed for each secrets to inform the system that their storage in the cache is round of encryption or decryption. not allowed. While our countermeasure applies to any secret information stored by the system, we use the disk encryption B. Countermeasure Effectiveness example as a case study to illustrate the concept. Table V summarizes the effectiveness of our countermea- sure. We can see that by marking the AES key structure A. Countermeasure Design uncacheable our countermeasure completely eliminated the The process of decrypting an encrypted storage device in a security vulnerability of NVM caches. All attack scenarios we system typically involves using the cryptsetup command- examined are now unable to find the disk encryption keys in line utility in which calls the dm-crypt kernel the cache, regardless of the benchmarks running on the system. module. This process is illustrated in Figure 9. The kernel The targeted power-off attacks also fail to identify any AES establishes within its internal cryptographic structures the key keys in the cache once the countermeasure is deployed. to be used for accessing the encrypted device that has been selected via the cryptsetup utility. Although the process C. Performance Overhead of establishing the key inside the kernel entails generating The effectiveness of our countermeasure comes with the multiple copies of the key, the relevant block cipher routines in cost of some performance overhead. Figure 10 shows the per- the crypto module dutifully use memset() to wipe the key formance overhead for different types of benchmarks executed Fig. 9: Countermeasure deployment in a system with encrypted storage.

Countermeasure-NoNEON Countermeasure-NEON 2 2.16 2.77 2.95 2.42 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 Normalized Execution Time 0.7 0.6 calculixdealIIgamessgromacsh264refnamdperlbenchpovraysjeng specrandfspecranditonto wrf GeoMeanCastar bwavesbzip2cactusADMgcc GemsFDTDgobmkhmmerlbm leslie3dlibquantummcf milc omnetppsoplexsphinx3XalanzeusmpGeoMeanMGeoMeanmixCmixMmixCM

Computation Bound Memory Bound Fig. 10: Average countermeasure performance overhead of all sizes of LLCs with different benchmarks.

NoNEON NEON Countermeasure Single-threaded vector registers and reducing the need for memory accesses. 23 - 70% 5 - 77% 0% Benchmark The performance overheads are higher as expected for mixC 85 - 100% 80 - 100% 0% multi-programmed workloads because they perform more en- mixM 26 - 100% 20 - 100% 0% mixCM 38 - 100% 34 - 100% 0% cryption/decryption transactions overall. Overheads for the Normal Power-off 0 - 100% 0 - 100% 0% three workload mixes are 64% in NoNEON and 14% in NEON Forced Power-off 100% 100% 0% for mixC, 55% in NoNEON and 3% in NEON for mixM, and TABLE V: Probability of finding AES keys in systems with 142% in NoNEON and 12% in NEON for mixCM. and without the countermeasure. Performance Optimization One observation we make is that when the authenticated user is currently using the system, it is unnecessary to keep the sensitive data uncacheable since in the context of our random attack. In general, the overhead physical cold boot attacks are unlikely to be useful when the for the system with NEON support is very low, averaging 2% user is logged in. The attacker can extract sensitive data in for the single-threaded benchmarks. The overhead increases more straightforward ways under those circumstances. One substantially if the system has no NEON acceleration – up to possible performance optimization is to enable two modes 45% for single-threaded benchmarks. Performance overhead of handling secret information — cacheable and uncacheable. of the countermeasure correlates directly with the number When user is logged in, cacheable mode on secrets is enabled of encryption/decryption transactions. Since the encryption so that user won’t experience any performance degradation of key is uncacheable, every access to the key will result in the system. Only when the system is locked, secret information a slow memory transaction. The NEON hardware support inside caches will be erased and then uncacheable mode will helps alleviate this overhead substantially by storing the key in be turned on to protect from cold boot attacks. VIII.CONCLUSION [18] R. Lee, P. Kwan, J. McGregor, J. Dwoskin, and Z. Wang, “Architecture for Protecting Critical Secrets in Microprocessors,” in International This paper demonstrates that non-volatile caches are very Symposium on Computer Architecture (ISCA), pp. 2–13, June 2005. vulnerable to cold boot attacks. We successfully conducted two [19] D. Lie, C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, and M. Horowitz, “Architectural Support for Copy and Tamper Resistant attacks on disk encryption keys — random attacks and targeted Software,” in International Conference on Architectural Support for power-off attacks. Our study shows that the probability of Programming Languages and Operating Systems (ASPLOS), pp. 168– finding the secret AES keys in NVM caches ranges from 5% 177, November 2000. [20] B. Rogers, S. Chhabra, M. Prvulovic, and Y. Solihin, “Using Address to 100% with varying workloads and cache configurations in Independent Seed Encryption and Bonsai Merkle Trees to Make Secure random attacks and always reaches 100% in targeted power- Processors OS- and Performance-Friendly,” in International Symposium off attacks. To defend computer systems against these attacks on Microarchitecture (MICRO), pp. 183–196, December 2007. [21] V. Young, P. J. Nair, and M. K. Qureshi, “DEUCE: Write-Efficient we developed a software-based countermeasure that allocates Encryption for Non-Volatile Memories,” in International Conference sensitive information into uncacheable memory pages. Our on Architectural Support for Programming Languages and Operating proposed countermeasure completely mitigates cold boot at- Systems (ASPLOS), pp. 33–44, March 2015. [22] P. Colp, J. Zhang, J. Gleeson, S. Suneja, E. D. Lara, H. Raj, S. Saroiu, tacks against NVM caches. We hope this work will serve as a and A. Wolman, “Protecting Data on and Tablets from starting point for future studies on the security vulnerabilities Memory Attacks,” in International Conference on Architectural Sup- of NVM caches and their countermeasures. port for Programming Languages and Operating Systems (ASPLOS), pp. 177–189, March 2015. [23] T. Muller, A. Dewald, and F. C. Freiling, “AESSE: A Cold-Boot REFERENCES Resistant Implementation of AES,” in European Workshop on System Security, pp. 42–47, April 2010. [1] J. S. Meena, S. M. Sze, U. Chand, and T.-Y. Tseng, “Overview of [24] T. Muller, F. C. Freiling, and A. Dewald, “TRESOR Runs Encryption Emerging Non-Volatile Memory Technologies,” Nanoscale Research Securely Outside RAM,” in USENIX Security Symposium, pp. 17–32, Letters, vol. 9, pp. 1–33, September 2014. August 2011. [2] X. Guo, E. Ipek, and T. Soyata, “Resistive Computation: Avoiding the [25] J. Gotzfried and T. Muller, “ARMORED: CPU-Bound Encryption for Power Wall with Low-Leakage, STT-MRAM Based Computing,” in Android-Driven ARM Devices,” in International Conference on Avail- International Symposium on Computer Architecture (ISCA), pp. 371– ability, Reliability and Security (ARES), pp. 161–168, September 2013. 382, June 2010. [26] L. Guan, J. Lin, B. Luo, and J. Jing, “Copker: Computing with Private [3] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, “A Durable and Energy Keys without RAM,” in Annual Network and Distributed System Security Efficient Main Memory Using Phase Change Memory Technology,” in Symposium (NDSS), pp. 1–15, February 2014. International Symposium on Computer Architecture (ISCA), pp. 14–23, [27] N. Zhang, K. Sun, W. Lou, and Y. T. Hou, “CaSE: Cache-Assisted June 2009. Secure Execution on ARM Processors,” in IEEE Symposium on Security [4] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, and Privacy (SP), pp. 72–90, May 2016. and Y. Xie, “Overcoming the Challenges of Crossbar Resistive Mem- [28] J. Gotzfried, T. Muller, S. Nurnberger, and M. Backes, “RamCrypt: ory Architectures,” in International Symposium on High Performance Kernel-Based Address Space Encryption for User-Mode Processes,” in Computer Architecture (HPCA), pp. 476–488, February 2015. Asia Conference on Computer and Communications Security, pp. 919– [5] X. Pan and R. Teodorescu, “NVSleep: Using Non-Volatile Memory to 924, May 2016. Enable Fast Sleep/Wakeup of Idle Cores,” in International Conference [29] A. Bacha and R. Teodorescu, “Authenticache: Harnessing Cache ECC on Computer Design (ICCD), pp. 400–407, October 2014. for System ,” in International Symposium on Microarchi- [6] X. Pan, A. Bacha, and R. Teodorescu, “Respin: Rethinking Near- tecture (MICRO), pp. 1–12, December 2015. Threshold Multiprocessor Design with Non-Volatile Memory,” in In- [30] P. Simmons, “Security through Amnesia: A Software-Based Solution to ternational Parallel and Distributed Processing Symposium (IPDPS), the on Disk Encryption,” in Annual pp. 265–275, May 2017. Applications Conference, pp. 73–82, December 2011. [7] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A Novel Architecture [31] J. Pabel, “Frozen Cache.” http://frozencache.blogspot.com. of the 3D Stacked MRAM L2 Cache for CMPs,” in International Sym- [32] N. A. Anagnostopoulos, S. Katzenbeisser, M. Rosenstihl, A. Schaller, posium on High Performance Computer Architecture (HPCA), pp. 239– S. Gabmeyer, and T. Arul, “Low-Temperature Attacks 249, February 2009. Against Intrinsic SRAM PUFs.” Cryptology ePrint Archive, Report [8] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting Phase Change 2016/769, 2016. Memory as a Scalable DRAM Alternative,” in International Symposium [33] “Advanced Encryption Standard,” National Institute of Standards and on Computer Architecture (ISCA), pp. 2–13, June 2009. Technology, Federal Information Processing Standards Publication 197, [9] Everspin. https://www.everspin.com. November 2011. [10] Crossbar. https://www.crossbar-inc.com. [34] A. Berent, “Advanced Encryption Standard by Example.” [11] Micron-Intel, “3D XPoint™ Technology.” https://www.micron.com. http://www.infosecwriters.com/Papers/ABerent AESbyExample.pdf. [12] Hewlett-Packard-Enterprise, “The Machine.” https://www.hpe.com. [35] T. Muller and M. Spreitzenbarth, “FROST: Forensic Recovery of Scram- [13] J. A. Halderman, S. D. Schoen, N. Heninger, W. Clarkson, W. Paul, bled Telephones,” in International Conference on Applied Cryptography J. A. Calandrino, A. J. Feldman, J. Appelbaum, and E. W. Felten, “Lest and Network Security, pp. 373–388, June 2013. We Remember: Cold Boot Attacks on Encryption Keys,” in USENIX [36] ARM, “DS-5 Development Studio.” https://developer.arm.com. Security Symposium, pp. 45–60, July 2008. [37] ARM, “DSTREAM High Performance Debug and Trace.” [14] M. Henson and S. Taylor, “Memory Encryption: A Survey of Existing https://developer.arm.com. Techniques,” ACM Computing Surveys (CSUR), vol. 46, April 2014. [38] ARM, “Juno ARM Development Platform.” https://developer.arm.com. [15] S. Chhabra and Y. Solihin, “i-NVMM: A Secure Non-Volatile Main [39] “The XTS-AES Tweakable Block Cipher,” IEEE Std 1619-2007, May Memory System with Incremental Encryption,” in International Sympo- 2007. sium on Computer Architecture (ISCA), pp. 177–188, June 2011. [40] N. Binkert et al., “The gem5 Simulator,” ACM SIGARCH Computer [16] S. F. Yitbarek, M. T. Aga, R. Das, and T. Austin, “Cold Boot Attacks Architecture News, vol. 39, pp. 1–7, May 2011. are Still Hot: Security Analysis of Memory Scramblers in Modern Pro- [41] J. Lell, “Practical Malleability Attack Against CBC-Encrypted cessors,” in International Symposium on High Performance Computer LUKS Partitions.” http://www.jakoblell.com/blog/2013/12/22/practical- Architecture (HPCA), pp. 313–324, February 2017. malleability-attack-against-cbc-encrypted-luks-partitions. [17] G. E. Suh, D. Clarke, B. Gassend, M. van Dijk, and S. Devadas, [42] A. Jaleel, “A Pin-Based Memory Characterization of the “Efficient Memory Integrity Verification and Encryption for Secure SPEC CPU2000 and SPEC CPU2006 Benchmark Suites.” Processors,” in International Symposium on Microarchitecture (MICRO), http://www.jaleels.org/ajaleel/publications/SPECanalysis.pdf. pp. 339–350, December 2003.