<<

Strong and Efficient Cache Side-Channel Protection using Hardware Transactional Memory Daniel Gruss, Graz University of Technology, Graz, Austria; Julian Lettner, University of California, Irvine, USA; Felix Schuster, Olya Ohrimenko, Istvan Haller, and Manuel Costa, Microsoft Research, Cambridge, UK https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/gruss

This paper is included in the Proceedings of the 26th USENIX Security Symposium August 16–18, 2017 • Vancouver, BC, Canada ISBN 978-1-931971-40-9

Open access to the Proceedings of the 26th USENIX Security Symposium is sponsored by USENIX Strong and Efficient Cache Side-Channel Protection using Hardware Transactional Memory

Daniel Gruss,∗ Julian Lettner,† Felix Schuster, Olga Ohrimenko, Istvan Haller, Manuel Costa Microsoft Research

Abstract sources [34,55] and eliminating sharing always comes at Cache-based side-channel attacks are a serious problem the cost of efficiency. Similarly, the detection of cache in multi-tenant environments, for example, modern cloud side-channel attacks is not always sufficient, as recently data centers. We address this problem with Cloak, a demonstrated attacks may, for example, recover the en- new technique that uses hardware transactional mem- tire secret after a single run of a vulnerable cryptographic ory to prevent adversarial observation of cache misses algorithm [17, 43, 62]. Furthermore, attacks on singular on sensitive code and data. We show that Cloak pro- sensitive events are in general difficult to detect, as these vides strong protection against all known cache-based can operate at low attack frequencies [23]. side-channel attacks with low performance overhead. We In this paper, we present Cloak, a new efficient defen- demonstrate the efficacy of our approach by retrofitting sive approach against cache side-channel attacks that al- vulnerable code with Cloak and experimentally confirm- lows resource sharing. At its core, our approach prevents ing immunity against state-of-the-art attacks. We also cache misses on sensitive code and data. This effectively show that by applying Cloak to code running inside In- conceals cache access-patterns from attackers and keeps tel SGX enclaves we can effectively block information the performance impact low. We ensure permanent cache leakage through cache side channels from enclaves, thus residency of sensitive code and data using widely avail- addressing one of the main weaknesses of SGX. able hardware transactional memory (HTM), which was originally designed for high-performance concurrency. 1 Introduction HTM allows potentially conflicting threads to execute transactions optimistically in parallel: for the duration Hardware-enforced isolation of virtual machines and of a transaction, a thread works on a private memory containers is a pillar of modern cloud computing. While snapshot. In the event of conflicting concurrent mem- the hardware provides isolation at a logical level, physi- ory accesses, the transaction aborts and all correspond- cal resources such as caches are still shared amongst iso- ing changes are rolled back. Otherwise, changes become lated domains, to support efficient multiplexing of work- visible atomically when the transaction completes. Typi- loads. This enables different forms of side-channel at- cally, HTM implementations use the CPU caches to keep tacks across isolation boundaries. Particularly worri- track of transactional changes. Thus, current implemen- some are cache-based attacks, which have been shown tations like TSX require that all accessed memory to be potent enough to allow for the extraction of sensi- remains in the CPU caches for the duration of a transac- tive information in realistic scenarios, e.g., between co- tion. Hence, transactions abort not only on real conflicts located cloud tenants [56]. but also whenever transactional memory is evicted pre- In the past 20 years cache attacks have evolved from maturely to DRAM. This behavior makes HTM a pow- theoretical attacks [38] on implementations of crypto- erful tool to mitigate cache-based side channels. graphic algorithms [4] to highly practical generic attack The core idea of Cloak is to execute leaky algorithms primitives [43,62]. Today, attacks can be performed in an in HTM-backed transactions while ensuring that all sen- automated on a wide range of algorithms [24]. sitive data and code reside in transactional memory for Many countermeasures have been proposed to miti- the duration of the execution. If a transaction suc- gate cache side-channel attacks. Most of these coun- ceeds, secret-dependent control flows and data accesses termeasures either try to eliminate resource sharing [12, are guaranteed to stay within the CPU caches. Other- 18, 42, 52, 58, 68, 69], or they try to mitigate attacks wise, the corresponding transaction would abort. As we after detecting them [9, 53, 65]. However, it is diffi- show and discuss, this simple property can greatly raise cult to identify all possible leakage through shared re- the bar for contemporary cache side-channel attacks or

∗ even prevent them completely. The Cloak approach can Work done during internship at Microsoft Research; affiliated with be implemented on of any HTM that provides the Graz University of Technology. †Work done during internship at Microsoft Research; affiliated with aforementioned basic properties. Hence, compared to University of California, Irvine. other approaches [11, 42, 69] that aim to provide isola-

USENIX Association 26th USENIX Security Symposium 217 tion, Cloak does not require any changes to the operating 2 Background system (OS) or kernel. In this paper, we focus on Intel TSX as HTM implementation for Cloak. This choice is We now provide background on cache side-channel at- natural, as TSX is available in many recent professional tacks and hardware transactional memory. and consumer Intel CPUs. Moreover, we show that we can design a highly secure execution environment by us- 2.1 Caches ing Cloak inside Intel SGX enclaves. SGX enclaves pro- vide a secure execution environment that aims to protect Modern CPUs have a hierarchy of caches that store and against hardware attackers and attacks from malicious efficiently retrieve frequently used instructions and data, OSs. However, code inside SGX enclaves is as much vul- thereby, often avoiding the latency of main memory ac- nerable to cache attacks as normal code [7,20,46,57] and, cesses. The first-level cache is the usually the small- when running in a malicious OS, is prone to other mem- est and fastest cache, limited to several KB. It is typi- ory access-based leakage including page faults [10, 61]. cally a private cache which cannot be accessed by other We demonstrate and discuss how Cloak can reliably de- cores. The last-level cache (LLC), is typically unified fend against such side-channel attacks on enclave code. and shared among all cores. Its size is usually limited We provide a detailed evaluation of Intel TSX as avail- to several MBs. On modern architectures, the LLC is able in recent CPUs and investigate how different im- typically inclusive to the lower-level caches like the L1 plementation specifics in TSX lead to practical chal- caches. That is, a cache line can only be in an L1 cache lenges which we then overcome. For a range of proof- if it is in the LLC as well. Each cache is organized in of-concept applications, we show that Cloak’s runtime cache sets and each cache set consists of multiple cache overhead is small—between −0.8% and +1.2% for low- lines or cache ways. Since more addresses map to the memory tasks and up to +248% for memory-intense same cache set than there are ways, the CPU employs tasks in SGX—while state-of-the-art cache attacks are a cache replacement policy to decide which way to re- effectively mitigated. Finally, we also discuss limitations place. Whether data is cached or not is visible through of Intel TSX, specifically negative side effects of the ag- the memory access latency. This is a root cause of the gressive and sparsely documented hardware prefetcher. side channel introduced by caches. The key contributions of this work are:

• We describe Cloak, a universal HTM-based ap- 2.2 Cache Side-Channel Attacks proach for the effective mitigation of cache attacks. Cache attacks have been studied for two decades with an initial focus on cryptographic algorithms [4, 38, 51]. • We investigate the peculiarities of Intel TSX and More recently, cache attacks have been demonstrated in show how Cloak can be implemented securely and realistic cross-core scenarios that can deduce informa- efficiently on top of it. tion about single memory accesses performed in other programs (i.e., access-driven attacks). We distinguish be- • We propose variants of Cloak as a countermeasure tween the following access-driven cache attacks: Evict+ against cache attacks in realistic environments. Time, Prime+Probe, Flush+Reload. While most attacks directly apply one of these techniques, there are many • We discuss how SGX and TSX in concert can pro- variations to match specific capabilities of the hardware vide very high security in hostile environments. and software environment. In Evict+Time, the victim computation is invoked re- Outline. The remainder of this paper is organized peatedly by the attacker. In each run, the attacker se- as follows. In Section 2, we provide background on lectively evicts a cache set and measures the victim’s software-based side-channel attacks and hardware trans- execution time. If the eviction of a cache set results in actional memory. In Section 3, we define the attacker longer execution time, the attacker learns that the victim model. In Section 4, we describe the fundamental idea of likely accessed it. Evict+Time attacks have been exten- Cloak. In Section 5, we show how Cloak can be instan- sively studied on different cache levels and exploited in tiated with Intel TSX. In Section 6, we provide an eval- various scenarios [51, 60]. Similarly, in Prime+Probe, uation of Cloak on state-of-the-art attacks in local and the attacker fills a cache set with their own lines. After cloud environments. In Section 7, we show how Cloak waiting for a certain period, the attacker measures if all makes SGX a highly secure execution environment. In their lines are still cached. The attacker learns whether Section 8, we discuss limitations of Intel TSX with re- another process—possibly the victim—accessed the se- spect to Cloak. In Section 9, we discuss related work. lected cache set in the meantime. While the first Prime+ Finally, we provide conclusions in Section 10. Probe attacks targeted the L1 cache [51,54], more recent

218 26th USENIX Security Symposium USENIX Association attacks have also been demonstrated on the LLC [43, Thread 1 Thread 2 50, 56]. Flush+Reload [62] is a powerful but also con- Begin transaction strained technique; it requires attacker and victim to share memory pages. The attacker selectively flushes undo Read 0x20 Read 0x20 a shared line from the cache and, after some waiting, checks if it was brought back through the victim’s ex- Write 0x40 ecution. Flush+Reload attacks have been studied exten- Write 0x40 sively in different variations [2, 41, 66]. Apart from the write conflict CPU caches, the shared nature of other system resources End transaction has also been exploited in side-channel attacks. This includes different parts of the CPU’s branch-prediction Figure 1: HTM ensures that no concurrent modifications facility [1, 15, 40], the DRAM row buffer [5, 55], the influence the transaction, either by preserving the old page-translation caches [21, 28, 36] and other micro- value or by aborting and reverting the transaction. architectural elements [14]. This paper focuses on mitigating Prime+Probe and evicted to DRAM, transactions necessarily abort in these Flush+Reload. However, Cloak conceptually also implementations. Nakaike et al. [48] found that all four thwarts other memory-based side-channel attacks such implementations detected access conflicts at cache-line as those that exploit the shared nature of the DRAM. granularity and that failed transactions were reverted by invalidating the cache lines of the corresponding write 2.3 Hardware Transactional Memory sets. Depending on the implementation, read and write set can have different sizes, and set sizes range from mul- HTM allows for the efficient implementation of paral- tiple KB to multiple MB of HTM space. lel algorithms [27]. It is commonly used to elide ex- Due to HTM usually being implemented within pensive software synchronization mechanisms [16, 63]. the CPU cache hierarchy, HTM has been proposed Informally, for a CPU thread executing a hardware trans- as a means for optimizing cache maintenance and action, all other threads appear to be halted; whereas, for performing security-critical on-chip computations: from the outside, a transaction appears as an atomic oper- Zacharopoulos [64] uses HTM combined with prefetch- ation. A transaction fails if the CPU cannot provide this ing to reduce the system energy consumption. Guan et al. atomicity due to resource limitations or conflicting con- [25] designed a system that uses HTM to keep RSA pri- current memory accesses. In this case, all transactional vate keys encrypted in memory and only decrypt them changes need to be rolled back. To be able to detect con- temporarily inside transactions. Jang et al. [36] used flicts and revert transactions, the CPU needs to keep track hardware transaction aborts upon page faults to defeat of transactional memory accesses. Therefore, transac- kernel address-space layout randomization. tional memory is typically divided into a read set and a write set. A transaction’s read set contains all read mem- ory locations. Concurrent read accesses by other threads 3 Attacker Model to the read set are generally allowed; however, concur- We consider multi-tenant environments where tenants do rent writes are problematic and—depending on the actual not trust each other, including local and cloud environ- HTM implementation and circumstances—likely lead to ments, where malicious tenants can use shared resources transactional aborts. Further, any concurrent accesses to extract information about other tenants. For example, to the write set necessarily lead to a transactional abort. they can influence and measure the state of caches via Figure 1 visualizes this exemplarily for a simple transac- the attacks described in Section 2.2. In particular, an at- tion with one conflicting concurrent thread. tacker can obtain a high-resolution trace of its own mem- ory access timings, which are influenced by operations of Commercial Implementations. Implementations of the victim process. More abstractly, the attacker can ob- HTM can be found in different commercial CPUs, tain a trace where at each time frame the attacker learns among others, in many recent professional and consumer whether the victim has accessed a particular memory lo- Intel CPUs. Nakaike et al. [48] investigated four com- cation. We consider the above attacker in three realistic mercial HTM implementations from Intel and other ven- environments which give her different capabilities: dors. They found that all processors provide comparable functionality to begin, end, and abort transactions and Cloud We assume that the processor, the OS and the that all implement HTM within the existing CPU cache hypervisor are trusted in this scenario while other hierarchy. The reason for this is that only caches can be cloud tenants are not. This enables the attacker to held in a consistent state by the CPU itself. If data is launch cross-VM Prime+Probe attacks.

USENIX Association 26th USENIX Security Symposium 219 Local This scenario is similar to the Cloud scenario, but Listing 1: A vulnerable crypto operation protected by we assume the machine is not hosted in a cloud en- Cloak instantiated with Intel TSX; the AES encrypt vironment. Therefore, the tenants share the machine function makes accesses into lookup_tables that de- in a traditional time-sharing fashion and the OS is pend on key. Preloading the tables and running the trusted to provide isolation between tenants. Fur- encryption code within a HTM transaction ensures that thermore, we assume that there are shared libraries eviction of table entries from LLC will terminate the between the victim and the attacker, since this is a code before it may cause a cache miss. common optimization performed by OSs. This en- ables the attacker to launch Flush+Reload attacks, i f ( ( s t a t u s = xbegin ()) == XBEGIN STARTED ) { f o r ( auto p : l o o k u p t a b l e s ) in addition to Prime+Probe attacks. ∗( v o l a t i l e s i z e t ∗) p ; AES encrypt(plaintext , ciphertext , &key); SGX In this scenario, the processor is trusted but the ad- xend ( ) ; versary has full control over the OS, the hypervisor, } and all other code running on the system, except the victim’s code. This scenario models an SGX- enabled environment, where the victim’s code runs inside an enclave. While the attacker has more con- R2 A transaction aborts immediately when any part of trol over the software running on the machine, the transactional memory leaves the cache hierarchy. SGX protections prevent sharing of memory pages between the enclave and untrusted processes, which R3 All pending transactional memory accesses are renders Flush+Reload attacks ineffective in this set- purged during a transactional abort. ting.

All other side-channels, including power analysis, and R4 Prefetching decisions outside of transactions are not channels based on shared microarchitectural elements influenced by transactional memory accesses. other than caches are outside our scope. R1 ensures that all sensitive code and data can be 4 Hardware Transactional Memory as a added to the transactional memory in a deterministic and Side-Channel Countermeasure leakage-free manner. R2 ensures that any cache line evictions are detected implicitly by the HTM and the The foundation of all cache side-channel attacks are the transaction aborts before any non-cached access is per- timing differences between cache hits and misses, which formed. R3 and R4 ensure that there is no leakage after an attacker tries to measure. The central idea behind a transaction has succeeded or aborted. Cloak is to instrument HTM to prevent any cache misses Unfortunately, commercially available HTM imple- on the victim’s sensitive code and data. In Cloak, all mentations and specifically Intel TSX do not precisely sensitive computation is performed in HTM. Crucially, provide R1–R4. In the following Section 5 we discuss in Cloak, all security-critical code and data is determin- how Cloak can be instantiated on commercially available istically preloaded into the caches at the beginning of (and not ideal) HTM, what leakage remains in practice, a transaction. This way, security-critical memory loca- and how this can be minimized. tions become part of the read or write set and all sub- sequent, possibly secret-dependent, accesses are guar- anteed to be served from the CPU caches. Otherwise, 5 Cloak based on Intel TSX in case any preloaded code or data is evicted from the cache, the transaction necessarily aborts and is reverted. Cloak can be built using an HTM that satisfies R1–R4 (See Listing 1 for an example that uses the TSX instruc- established in the previous section. We propose Intel tions xbegin and xend to start and end a transaction.) TSX as an instantiation of HTM for Cloak to mitigate Given an ideal HTM implementation, Cloak thus pre- the cache side channel. In this section, we evaluate vents that an attacker can obtain a trace that shows how far Intel TSX meets R1–R4 and devise strategies whether the victim has accessed a particular memory lo- to address those cases where it falls short. All experi- cation. More precisely, in the sense of Cloak, ideal HTM ments we report on in this section were executed on In- has the following properties: tel Core i7 CPUs of the Skylake generation (i7-6600U, R1 Both data and code can be added to a transaction as i7-6700, i7-6700K) with 4MB or 8MB of LLC. The transactional memory and thus are included in the source code of these experiments will be made available HTM atomicity guarantees. at http://aka.ms/msr-cloak.

220 26th USENIX Security Symposium USENIX Association 5.1 Meeting Requirements with Intel TSX 100% We summarize our findings and then describe the methodology. 50%

R1 and R2 hold for data. It supports read-only data that Failure rate does not exceed the size of the LLC (several MB) 0% and write data that does not exceed the size of the 0 1,000 2,000 3,000 4,000 5,000 L1 cache (several KB); Array size in KB R1 and R2 hold for code that does not exceed the size of the LLC; Figure 2: A TSX transaction over a loop reading an array of increasing size. The failure rate reveals how much R3 and R4 hold in the cloud and SGX attacker scenarios data can be read in a transaction. Measured on an i7- from Section 3, but not in general for local attacker 6600U with 4 MB LLC. scenarios. cache lines that were evicted from the L1 cache. This 5.1.1 Requirements 1&2 for Data structure is possibly an on-chip bloom filter, which tracks Our experiments and previous work [19] find the read set the read-set membership of cache lines in a probabilistic size to be ultimately constrained by the size of the LLC: manner that may give false positives but no false nega- 2 Figure 2 shows the failure rate of a simple TSX transac- tives. There may exist so far unknown leaks through tion depending on the size of the read set. The abort rate this data structure. If this is a concern, all sensitive data reaches 100% as the read set size approaches the limits (including read-only data) can be kept in the write set in of the LLC (4MB in this case). In a similar experiment, L1. However, this limits the working set to the L1 and we observed 100% aborts when the size of data written in also requires all data to be stored in writable memory. a transaction exceeded the capacity of the L1 data cache (32 KB per core). This result is also confirmed in In- L1 Cache vs. LLC. By adding all data to the write tel’s Optimization Reference Manual [30] and in previ- set, we can make R1 and R2 hold for data with respect ous work [19, 44, 64]. to the L1 cache. This is important in cases where vic- tim and attacker potentially share the L1 cache through Conflicts and Aborts. We always observed aborts hyper-threading.3 Shared L1 caches are not a concern in when read or write set cache lines were actively evicted the cloud setting, where it is usually ensured by the hy- from the caches by concurrent threads. That is, evictions pervisor that corresponding hyper-threads are not sched- of write set cache lines from the L1 cache and read set uled across different tenants. The same can be ensured cache lines from the LLC are sufficient to cause aborts. by the OS in the local setting. However, in the SGX set- We also confirmed that transactions abort shortly after ting a malicious OS may misuse hyper-threading for an cache line evictions: using concurrent clflush instruc- L1-based attack. To be not constrained to the small L1 tions on the read set, we measured abort latencies in the in SGX nonetheless, we propose solutions to detect and order of a few hundred cycles (typically with an upper prevent such attacks later on in Section 7.2. bound of around 500 cycles). In case varying abort times We conclude that Intel TSX sufficiently fulfills R1 should prove to be an issue, the attacker’s ability to mea- and R2 for data if the read and write sets are used ap- sure them, e.g., via Prime+Probe on the abort handler, propriately. could be thwarted by randomly choosing one out of many possible abort handlers and rewriting the xbegin instruc- 5.1.2 Requirements 1&2 for Code tion accordingly,1 before starting a transaction. We observed that the amount of code that can be exe- cuted in a transaction seems not to be constrained by the Tracking of the Read Set. We note that the data struc- sizes of the caches. Within a transaction with strictly no ture that is used to track the read set in the LLC is un- reads and writes we were reliably able to execute more known. The Intel manual states that “an implementation- specific second level structure” may be available, which 2In Intel’s Software Development Emulator [29] the read set is probabilistically keeps track of the addresses of read-set tracked probabilistically using bloom filters. 3Context switches may also allow the attacker to examine the vic- 1The 16-bit relative offset to a transaction’s abort handler is part of tim’s L1 cache state “postmortem”. While such attacks may be pos- the xbegin instruction. Hence, for each xbegin instruction, there is a sible, they are outside our scope. TSX transactions abort on context region of 1 024 cache lines that can contain the abort handler code. switches.

USENIX Association 26th USENIX Security Symposium 221 100% by another one) do not cause transactional aborts. Hence, forms of R1 and R2 can also be ensured for code in the L1 instruction cache without it being part of the write set. 50% In summary, we can fulfill requirements R1 and R2 by moving code into the read set or, using undocumented Failure rate microarchitectural effects, by limiting the amount of 0% 0 16 32 48 64 code to the L1 instruction cache and preloading it via execution. Array size in KB

Figure 3: A TSX transaction over a nop-sled with in- 5.1.3 Requirements 3&4 creasing length. A second thread waits and then flushes As modern processors are highly parallelized, it is diffi- the first cache line once before the transaction ends. The cult to guarantee that memory fetches outside a transac- failure rate starts at 100% for small transaction sizes. If tion are not influenced by memory fetches inside a trans- the transaction self-evicts the L1 instruction cache line, action. For precisely timed evictions, the CPU may still e.g., when executing more than 32 KB of instructions, enqueue a fetch in the memory controller, i.e., a race con- the transaction succeeds despite of the flush. Measured dition. Furthermore, the hardware prefetcher is triggered on an i7-6600U with 32 KB L1 cache. if multiple cache misses occur on the same physical page within a relatively short time. This is known to intro- than 20 MB of nop instructions or more than 13 MB of duce noise in cache attacks [24,62], but also to introduce arithmetic instructions (average success rate ˜10%) on a side-channel leakage [6]. CPU with 8 MB LLC. This result strongly suggests that In an experiment with and a cycle- executed code does not become part of the read set and accurate alignment between attacker and victim, we in- is in general not explicitly tracked by the CPU. vestigated the potential leakage of Cloak instantiated To still achieve R1 and R2 for code, we attempted to with Intel TSX. To make the observable leakage as strong make code part of the read or write set by accessing it as possible, we opted to use Flush+Reload for the at- through load/store operations. This led to mixed results: tack primitive. We investigated how a delay between even with considerable effort, it does not seem possible the transaction start and the flush operation and a de- to reliably execute cache lines in the write set without lay between the flush and the reload operations influ- aborting the transaction.4 In contrast, it is generally pos- ence the probability that an attacker can observe a cache sible to make code part of the read set through explicit hit against code or data placed into transactional mem- loads. This gives the same benefits and limitations as ory. The victim in this experiment starts the transaction, using the read set for data. by placing data and code into transactional memory in a manner (using either reads, writes or execution). The victim then simulates meaningful program flow, fol- Code in the L1 Cache. Still, as discussed in the pre- lowed by an access to one of the sensitive cache lines vious Section 5.1.1, it can be desirable to achieve R1 and terminating the transaction. The attacker “guesses” and R2 for the L1 cache depending on the attack sce- which cache line the victim accessed and probes it. Ide- nario. Fortunately, we discovered undocumented mi- ally, the attacker should not be able to distinguish be- croarchitectural effects that reliably cause transactional tween correct and wrong guesses. aborts in case a recently executed cache line is evicted Figure 4 shows two regions where an attacker could from the cache hierarchy. Figure 3 shows how the trans- observe cache hits on a correct guess. The left region actional abort rate relates to the amount of code that is corresponds to the preloading of sensitive code/data at executed inside a transaction. This experiment suggests the beginning of the transaction. As expected, cache hits that a concurrent (hyper-) thread can cause a transac- in this region were observed to be identical to runs where tional abort by evicting a transactional code cache line the attacker had the wrong guess. On the other hand, currently in the L1 instruction cache. We verified that the right region is unique to instances where the attacker this effect exists for direct evictions through the clflush made a correct guess. This region thus corresponds to a instruction as well as indirect evictions through cache set window of around 250 cycles, where an attacker could conflicts. However, self-evictions of L1 code cache lines potentially obtain side-channel information. We explain (that is, when a transactional code cache line is replaced the existence of this window by the fact that Intel did not design TSX to be a side-channel free primitive, thus R3 4In line with our observation, Intel’s documentation [31] states that “executing self-modifying code transactionally may also cause trans- and R4 are not guaranteed to hold and a limited amount actional aborts”. of leakage remains. We observed identical high-level re-

222 26th USENIX Security Symposium USENIX Association Cache-Hit Ratio 5.2.1 Data Preloading

100 As discussed, exclusively using the write set for preload- 80 ing has the benefit that sensitive data is guaranteed to 100 60 stay within the small L1 cache, which is the most secure 80 40 60 20 option. To extend the working set beyond L1, sensitive 40 0 read-only data can also be kept in the LLC as described Percentage 20 1000 0 800 in Section 5.1.1. However, when doing so, special care 600 has to be taken. For example, na¨ıvely preloading a large 400 Reload-Delay 0 200 200 (> 32 KB) sequential read set after the write set leads 400 600 0 Flush-Delay 800 1000 to assured abortion during preloading, as some write set cache-lines are inevitably evicted from L1. Reversing Figure 4: Cache hits observed by a Flush+Reload at- the preloading order, i.e., read set before write set, partly tacker with the ability to overlap the attack with differ- alleviates this problem, but, depending on the concrete ent segments of the victim’s transaction. Cache hits can read set access patterns, one is still likely to suffer from be observed both in the region where the victim tries to aborts during execution caused by read/write set conflicts prepare its transactional memory, as well as in a small in the L1 cache. In the worst case, such self-eviction window around a secret access. The Z axis represents aborts may leak information. the success rate of the attacker observing a cache hit. To prevent such conflicts, in Cloak, we reserve cer- tain cache sets in L1 entirely for the write set. This is possible as the L1 cache-set index only depends on the sults for all forms of preloading (reading, writing, exe- virtual address, which is known at runtime. For exam- cuting) and all forms of secret accesses (reading, writing, ple, reserving the L1 cache sets with indexes 0 and 1 executing). gives a conflict-free write set of size 2 · 8 · 64B = 1KB. To exploit the leakage we found, the attacker has to For this allocation, it needs to be ensured that the same be able to determine whether the CPU reloaded a secret- 64 B cache lines of any 4 KB page are not part of the dependent memory location. This is only possible if the read set (see Figure 5 for an illustration). Conversely, the attacker shares the memory location with the victim, i.e., write set is placed in the same 64 B cache lines in up to only in local attacks but not in other scenarios. Further- eight different 4 KB pages. (Recall that an L1 cache set comprises eight ways.) Each reserved L1 cache set thus more, it is necessary to align execution between attacker th and victim to trigger the eviction in exactly the right cy- blocks 1/64 of the entire virtual memory from being cle range in the transaction. While these properties might used in the read set. be met in a Flush+Reload attack with fast eviction using While this allocation strategy plays out nicely in clflush and shared memory, it is rather unlikely that theory, we observed that apparently the CPU’s data an attack is possible using Prime+Probe, due to the low prefetcher [30] often optimistically pulled-in unwanted frequency of the attack [22] and the cache replacement cache lines that were conflicting with our write set. This policy. Thus, we conclude that requirements R3 and R4 can be mitigated by ensuring that sequentially accessed are likely fulfilled in all scenarios where the attacker can read cache lines are separated by a page boundary from only perform Prime+Probe, but not Flush+Reload, i.e., write cache lines and by adding “safety margins” be- cloud and SGX scenarios. Furthermore, requirements R3 tween read and write cache lines on the same page. and R4 are likely to be fulfilled in scenarios where an at- In general, we observed benefits from performing tacker can perform Flush+Reload, but not align with a preloading similar to recent Prime+Probe attacks [22, victim on a cycle base nor measure the exact execution 45], where a target address is accessed multiple times time of a TSX transaction, i.e., the local scenario. and interleaved with accesses to other addresses. Fur- ther, we observed that periodic “refreshing” of the write set, e.g., using the prefetchw instruction, reduced the chances of write set evictions in longer transactions. 5.2 Memory Preloading 5.2.2 Code Preloading Using right the memory preloading strategy is crucial for the effectiveness of Cloak when instantiated on top As described in Section 5.1.2, we preload code into the of TSX. In the following, we describe preloading tech- read set and optionally into the L1 instruction cache. To niques for various scenarios. The different behavior for preload it into the read set, we use the same approach as read-only data, writable data, and code, makes it neces- for data. However, to preload the code into the L1 in- sary to preload these memory types differently. struction cache we cannot simply execute the function,

USENIX Association 26th USENIX Security Symposium 223 Cache set 0 the target array, the working set for each individual trans- Way 0 Way 1 action is reduced and chances for transactional aborts de- CL 0 … CL 1 Way 7 crease. Ideally, the splitting would be done in an auto- • Write set in 2 out of 64 cache sets: CL 2 Cache set 1 2 x 8 x 64 bytes = 1024 bytes Way 0 mated manner by a compiler. In a context similar to ours … • Read set in rest of memory: Way 1 CL 63 … Constrained by LLC Way 7 though not directly applicable to Cloak, Shih et al. [59] CL 0 Cache set 2 report on an extension of the Clang compiler that auto-

Way 0 cache(32KB) CL 1 Way 1 matically splits transactions into smaller units with TSX- CL 2 … Way 7

… … data compatible working sets. Their approach is discussed in 4KBmemory pages

CL 63 Cache set 63 L1 Way 0 more detail in Section 9. … Way 1 … Way 7 5.3 Toolset Figure 5: Allocation of read and write sets in memory to We implemented the read-set preloading strategy from avoid conflicts in the L1 data cache Section 5.2.1 in a small C++ container template library. retq The library provides generic read-only and writable ar-

= 0f 1f c3 41 57 41 56 45 31 c9 41 55 41 54 ba 07 00 00 00 55 53 31 ff rays, which are allocated in “read” or “write” cache lines nop push push xor push push mov push push xor respectively. The programmer is responsible for arrang- 0x0 0x3 0x5 0x7 0xa 0xc 0xe 0x13 0x14 0x15 ing data in the specialized containers before invoking a 41 b8 ff ff ff ff b9 22 00 00 00 be 08 00 00 40 48 83 ec 18 e8 a3 f4 ff ff 48 89 c2 49 89 c7 48 c1 e2 3c 48 c1 ea 3f 48 85 d2 Cloak-protected function. Further, the programmer de- mov mov mov sub callq mov mov shl shr test cides which containers to preload. Most local variables 0x17 0x1d 0x22 0x27 0x2b 0x30 0x33 0x36 0x3a 0x3e and input and output data should reside in the containers. Further, all sub-function calls should be inlined, because Figure 6: Cache lines are augmented with a multi-byte each call instruction performs an implicit write of a re- nop instruction. The nop contains a byte c3 which is the turn address. Avoiding this is important for large read opcode of retq. By jumping directly to the retq byte, sets, as even a single unexpected cache line in the write we preload each cache line into the L1 instruction cache. set can greatly increase the chances for aborts. We also extended the Microsoft C++ compiler ver- sion 19.00. For programmer-annotated functions on as this would have unwanted side effects. Instead, we in- Windows, the compiler adds code for starting and end- sert a multi-byte nop instruction into every cache line, as ing transactions, ensures that all code cache lines are shown in Figure 6. This nop instruction does not change preloaded (via read or execution according to Sec- the behavior of the code during actual function execution tion 5.2.2) and, to not pollute the write set, refrains and only has a negligible effect on execution time. How- from unnecessarily spilling registers onto the stack af- ever, the multi-byte nop instruction allows us to incorpo- ter preloading. Both library and compiler are used in the rate a byte c3 which is the opcode of retq. Cloak jumps SGX experiments in Section 7.1. to this return instruction, loading the cache line into the instruction L1 cache but not executing the actual func- tion. In the preloading phase, we perform a call to each 6 Retrofitting Leaky Algorithms such retq instruction in order to load the correspond- ing cache lines into the L1 instruction cache. The retq To evaluate Cloak, we apply it to existing weak im- instruction immediately returns to the preloading func- plementations of different algorithms. We demonstrate tion. Instead of retq instructions, equivalent jmp reg that in all cases, in the local setting (Flush+Reload) as instructions can be inserted to avoid touching the stack. well as the cloud setting (Prime+Probe), Cloak is a prac- tical countermeasure to prevent state-of-the-art attacks. 5.2.3 Splitting Transactions All experiments in this section were performed on a mostly idle system equipped with a Intel i7-6700K CPU In case a sensitive function has greater capacity require- with 16 GB DDR4 RAM, running a default-configured ments than those provided by TSX, the function needs Ubuntu Linux 16.10. The applications were run as regu- to be split into a series of smaller transactional units. lar user programs, not pinned to CPU cores, but sharing To prevent leakage, the control flow between these units CPU cores with other threads in the system. and their respective working sets needs to be input- independent. For example, consider a function f() that 6.1 OpenSSL AES T-Tables iterates over a fixed-size array, e.g., in order to update certain elements. By reducing the number of loop itera- As a first application of Cloak, we use the AES T-table tions in f() and invoking it separately on fixed parts of implementation of OpenSSL which is known to be sus-

224 26th USENIX Security Symposium USENIX Association USENIX Association implementation T-Table AES the baseline For the in cor- encryptions implementation. time billion of 2 roughly amount to The responds run. is attack known-plaintext the 8), anymore. present Figure not is (cf. 7 Cloak Figure from with leakage T-tables When value. the plaintext-byte and protecting line of cache number per the showing hits Flush+ matrix cache color on and a is Prime+Probe hits 7 Figure using cache Reload. lines the cache T-table measured the and attack asynchronous 4 the fetches step preloading T-Tables, The sin- transaction. a TSX into step gle preloading the with together computation byte key of secret the line cache index the lookup learning the By attack. known-plaintext a and byte, key are val- AES the combines and using rounds this 10 ues the In of different each 4 51]. for to T-tables 35, lookups 33, 26, 16 24, performs [4, AES implementation, attacks cache to ceptible cases. both in visible not is side- leakage The channel (right). Flush+Reload for encryptions) 4 million and (left) encryptions) million Prime+Probe over (500 for performed transactions Measurement billion 3 roughly hits. cache Darker more Cloak. T- by means AES protected an is on implementation hits The cache table. showing matrix Color 8: Figure right. the on Prime+ Flush+Reload left, the encryptions. on billion depicted Measurement Probe 2 AES hits. roughly an cache over on more performed hits means cache Darker showing matrix T-table. Color 7: Figure efie h iefrwihtefully-asynchronous the which for time the fixed We an in encryptions billion 2 roughly performed We 15 12 15 12

Cache8 line4 0

Cache8 line4 0 00 00 T litx byte Plaintext litx byte Plaintext i.e. j xor 40 40 [ x tad Bo aat h edset. read the to data of KB 4 adds it , i = i h al okp ntefis on of round first the in lookups table The . 80 80 ≡ p x i j i ⊕ c0 x c0 natce ersteupr4bt of bits 4 upper the learns attacker an i o .Atpclatc cnrois scenario attack typical A 4. mod ⊕ k f8 i f8 ] p where i = k . i ilo rnatos(1 transactions billion 9 ewa h nieAES entire the We . p i 15 12 15 12

Cache8 line4 0 8 4 0

sapanetbyte, plaintext a is Cache line 00 00 litx byte Plaintext litx byte Plaintext 40 40 80 80 c0 c0 f8 f8 k i . 5 a rvosFuhRla tak 6,6] h attacker in Like the 67], [62, 16 example. attacks of AES tries Flush+Reload the one attacker previous the example, from executing that learn—adapted this value by secret to In scenario a on a based functions such [67]. cryptographic model libraries recover we in monitoring paths to by numbers execution random generated allow and [62] keys attacks Powerful paths execution Secret-dependent 6.2 hardware at- of deployed. the amount is Cloak perform significant not a to or whether up busy resources uses core this CPU and whole tack has attacker one the keep as to especially eliminated, is given lower leakage sufficient still The that is beneficial. attacked be being while also side- performance can cache it eliminate but only leakage not does channel case use countermea- this our for sure deploying that shows attacked actively attack. Flush+Reload a 0 of only and implementation baseline the 23 only is attack a under Prime+Probe implementation protected our of performance the rb.I h aeo ls+eodol 1 only Flush+Reload of 4 case of the out In Prime+ of case Probe. the only in transactions succeeded transactions billion million 3 500 almost these Flush+Reload. of using out 2 attack However, under and when Prime+Probe encryptions implementation using the baseline attack the under Thus, to when compared en- execution. as more function 37% cryptions over the started take implementation in would protected earlier the encryption abort one that they time as of amount full consume the not do transactions Failing respectively. attack, observed We 82 Flush+Reload. and Prime+Probe under ing [64] Zacharopoulos performance. improve as can TSX unexpected, that not found already is This tation. 0 protected was Cloak implementation with the account, ex- into the and Taking time size ecution 5.2.1). set Section read (cf. small preloading the optimized of because unlikely evic- are cache tions Furthermore, of unlikely. interruption very is or transaction preemption the Hence, typically cycles. is 500 algorithm below protected the of time execution the 0 ( implementation line 0 started Cloak with attack. under transac- if is TSX often attack more the failing to an tion due not is or This simultaneously. whether perfor- running on significant based a difference observed mance we Cloak with protected . %o h rnatosfie.Ti sntsrrsn,as surprising, not is This failed. transactions the of 1% h lgtpromnegi fCokwientbeing not while Cloak of gain performance slight The et emaue h ubro rnatosfail- transactions . of number the measured we Next, protected implementation the attack under not While %ad99 and 7% . ilo rnatosscedd hs ntotal in Thus, succeeded. transactions billion 9 . 7 ftetascin aln o each for failing transactions the of 97% 26th USENIX Security Symposium 225 . %fse hntebsln implemen- baseline the than faster 8% . %mr nrpin hntebase- the than encryptions more 8% i.e. ihpeodn)adls than less and preloading) with , . %o h efrac of performance the of 7% . 6 ntecase the in 06% . . million 4 3times 53 226 26th USENIXSecurity Symposium than slower times 20 are functions The attack. 0 Reload and attack re- Prime+Probe is a performance the 0 to overall duced Thus, 0 only Flush+Reload. and of Prime+Probe of case many. the as times 0 19 only Flush+Reload, However, with and started were 11 Prime+Probe, 1 of penalty performance Less implementation. 0 baseline than victim the the than 0 attack executions started under tion Cloak not with While protected program 10). Figure (cf. age set. read the to lines, added cache are code two a of spanning the into each fetches entire step functions, step preloading the 16 The preloading wrap We transaction. the TSX single 000. with 10 of to together consisting 0 switch-case snippet from code counting 16 loop small the a of a Each only runs called. thus been functions and has hits function cache which for derives addresses function the monitors cases. both in visible not is age leak- function Side-channel (right). (135211 Flush+Reload transactions for billion executions) 2 and (left) Prime+ Probe for executions) more function (77314 means transactions 1 Darker lion func- roughly over on performed Measurement Cloak. hits hits. using cache cache protected showing code matrix tion Color 10: Figure executions function (right). million Flush+Reload for 10 and for (left) executions Prime+Probe function million per- 100 Measurement function roughly over hits. on formed cache hits more cache means Darker showing code. matrix Color 9: Figure si h rvoseape la lmntsalleak- all eliminates Cloak example, previous the in As 15 12 15 12

Cache8 line4 0

Cache8 line4 0 . %o h rnatosfie,laigt noverall an to leading failed, transactions the of 1% 0 0 . 3 ftebsln efrac hnunder when performance baseline the of 03% 4 Function 4 Function 8 8 . . 0%o h rnatossceddin succeeded transactions the of 005% ie smn ucinexecutions function many as times 8 12 12 15 15 . % hnudratc using attack under When 2%. . 4 hnudraFlush+ a under when 14% 15 12 15 12

Cache8 line4 0

Cache8 line4 0 0 0 . 06 ntecase the in 0006% 4 Function 4 Function . %mr func- more 7% 8 8 i.e. 12 12 KB 2 , 15 . 15 bil- 5 uetefntosta r lte.Tebsln imple- baseline The was plotted. pro- are that to that summed routine functions then multiply are the traces duce the The on trace. hit per measured cache aligned first is generated trace the is Each by plot traces. The exponentiation RSA. 10000 in over used as routine multiply line cache 1 fetches secret function, step per the preloading The on information bits. no exponent signif- leaking we transactions still way, TSX while This the icantly, of loop. rate the success to the step increase preloading the and per square-and- tions transaction the TSX of bit, small loop exponent one the into split success algorithm we of multiply Instead chance low one very itself. in a by only algorithm has whole transaction the TSX wrapping Thus, to cycles complete. 000 100 takes algorithm pro- square-and-multiply and implementation. A information schoolbook attack vulnerable any very our a leak demonstrate on tection we not cache, to the intended through to are known that exponen- Though constant-time tiations is to 62]. and move [54, libraries attacks RSA cryptographic side-channel as to vulnerable such be algorithms implemen- of cryptographic Square-and- tations in used attacks. commonly is side-channel pro- multiply cache to square-and- against allows a it Cloak tect how against and attack exponentiation an multiply demonstrate now We example Square-and-Multiply RSA 6.3 a for accounting Cloak. load, without full and with on loss cores performance significant CPU keeps more simultaneously or attacker the one as essential not is tack more processes. is other there by as caused evictions unexpected, cache not for is time Thus, rate failure example. previous high the the from encryptions AES the cor- exponent. that secret patterns the pro- visual to have variant respond per- not The does Measurement Cloak with exponentiations. tected sequence. bit 10000 over a expo- formed as secret depicted The is exponentiations. nent used 10000 (as over routine multiply RSA) the in for traces Cache 11: Figure

iue1 hw ls+eodccetaefrthe for trace cache Flush+Reload a shows 11 Figure at- under performance the that note to important is It Cache hits 0 0 0 0 . . . . 0 2 4 6 8 1 0 1 0000000000000000000 · 10 4 i.e. i.e. 2 fcd r de otera set. read the to added are code of B 128 , 2 digthe adding , 1 0000000000000000000 Cloak ie(ncycles) (in Time 4 xbegin 1 00000000000 Baseline 6 USENIX Association and 1 0000000 xend 1 8 · 10 instruc- 4 USENIX Association per- still is that search keystroke. binary every upon one information formed leak identified that only searches we binary In- multiple 3.18.9. of version cur- library stead on GDK li- with resolved systems GDK been Linux on partially rent has attack auto- 3.10.8 an Their version in brary user, attack. a template by cache typed mated significant keys leaks single this on that information demonstrated [24] values. al. in- key et keyboard and Gruss names raw key translate platform-independent to to puts search which binary framework, a GTK performs the in leakage investigated We example keystroke GTK 6.4 repeated. be to had and failed 99 bit because is failed, This overhead tions higher 982. significantly factor a of observed we while attack 1 observed der was we attack overhead under runtime not total The 1 these 0 failed. During only exponentiations, exponentiations. we million case million this 1 in performance performed the imple- evaluate To baseline unprotected mentation. the of lower performance slightly the only than is Cloak with protected plementation the over hits cache time. the execution in full changes significant Cloak ex- with no secret protected shows the implementation in same bit 1 The each ponent. for peak clear a has mentation Cloak. with visi- protected not when and anymore ble implementation unprotected clearly the is for search visible binary the of pattern with means The performed Darker were Flush+Reload. measurements All (right). hits. Cloak cache more with and (left) protection in on search hits binary cache showing the matrix template Cache 12: Figure hl o ne tak h efrac fteim- the of performance the attack, under not While

Cache line 0x83200 0x80700 0 i.e. A Key h rnatosframs vr single every almost for transactions the , Z gdk . % nupiigy hl un- while Unsurprisingly, 1%. keyval . %o h transactions the of 5% . 5 ftetransac- the of 95% from Cache line 0x83200 0x80700 0 A name Key Z without iinte o niptrcr.Ec oeo h tree the of node Each record. input de- a an traverses for algorithm tree The cision 5.3. Section in a described of classification implementation tree C++ sion existing an sets, adapted working larger envi- we support SGX to capability the its and to ronment applicability Cloak’s demonstrate Classification To Tree Decision Secure 7.1 special a honors re- OS contract and malicious service potentially Cloak the with that code quire enclave sensitive Specifically, augment Cloak. we design of and top channels on side en- countermeasures to extended that regard with challenges faces special code SGX. clave the in explore performance we its algo- evaluate Afterwards, sec- learning and machine this Cloak common In with a rithm retrofit bus. first memory we the tion, on attacks hardware [61] faults and page other OS-induced of as un- range such a on attacks to run side-channel susceptible be also are to they meant hosts, are trusted enclaves side-channel as cache Further, to vulnerable attacks. thus cache are regular and the hierarchy use enclaves SGX en- However, memory strong cryption. of protected means even by is attacks and hardware system against the of rest the from shielded environment execution isolated called an provides SGX Intel SGX for Protection Side-Channel 7 on attacks keystrokes. template as such cache information fine-grained prevent to countermeasure la- overall the the for Thus, performance. negligible and typing. tency is user introduce the we of overhead speed and OS the the by by constrained rate-limited functions. are other keystrokes and drivers Furthermore, in thou- spent of cycles CPU hundreds this of involves sands for processing reason keystroke The that processes. is of time execution system or the a of load through increase overall typing and measure when nor latency keyboard, dif- perceived a the measure in neither could ference we Cloak, informa- the keystroke by Cloak, protected the With tion by disguised. ends. protected is search implementation pattern the search the thus of and case are keys In letter darker the the where on area search down narrowing binary pattern, unprotected search the the reveals the of 12, matrix Figure in template shown cache As window. GTK a in keystroke which in (3.18.9) search [24] binary library the al. attack GDK We et 16.10. the Ubuntu Gruss of with comes version by recent attack a the on reproduced we Cloak, gdk ecnld htCokcnb sda practical a as used be can Cloak that conclude We of applicability general the demonstrate to order In keyval enclave from l oeaddt niea nlv is enclave an inside data and code All . 26th USENIX Security Symposium 227 hl hscd srunning. is code this while name loih 4]uigtetoolset the using [49] algorithm hc seeue pnevery upon executed is which deci- 3x106 4 contains a predicate which is evaluated on features of the average cycles per query: Cloak input record. As a result, observations of unprotected average cycles per query: baseline 2.5x106 average overhead (Cloak / baseline) tree traversal can leak information about the tree and the 3 input record. In this particular case, several trees in a so- 2x106 called decision forest are traversed for each input record. 1.5x106 2 Our Cloak-enhanced implementation of the algorithm contains three programmer-annotated functions, which 1x106 1 translates into three independent transactions. The most average cycles per query 500000

complex of these traverses a preloaded tree for a batch of (Cloak / baseline) average overhead preloaded input records. The batching of input records is 0 0 0 50 100 150 200 250 300 crucial here for performance, as it amortizes the cost of queries per batch preloading a tree. We give a detailed explanation and a code sample of tree traversal with Cloak in Appendix A. Figure 13: Average number of cycles per query for deci- sion forest batch runs of different sizes. Evaluation. We compiled our implementation for SGX enclaves using the extended compiler and a cus- with the batch size, because the baseline also profits from tom SGX software stack. We used a pre-trained decision batching (i.e., “cache warming” effects and amortization forest for the Covertype data set from the UCI Machine of costs for entering/leaving the enclave), while the pro- Learning Repository5. Each tree in the forest consists of tected version experiences more transactional aborts for 30497—32663 nodes and has a size of 426 KB–457 KB. larger batches. We also ran a similar micro-benchmark Each input record is a vector of 54 floating point val- outside SGX with more precise timings. Here, the effect ues. We chose the Covertype data set because it pro- of batching was even clearer: for a batch size of 5, we duces large trees and was also used in previous work by observed a very high overhead of +3078%, which grad- Ohrimenko et al. [49], which also mitigates side channel ually decreased to +216% for a batch size of 260. leakage for enclave code. We report on experiments executed on a mostly idle Even though the experimental setting in Ohri- system equipped with a TSX and SGX-enabled Intel menko et al. [49] is not the same as ours (for instance Core i7-6700 CPU and 16 GB DDR4 RAM running Win- they used the official Intel SGX SDK, an older version dows Server 2016 Datacenter. In our container library, of the compiler, and their input data was encrypted) and we reserved eight L1 cache sets for writable arrays, re- they provide different guarantees, we believe that their + sulting in an overall write set size of 4 KB. Figure 13 reported overhead of circa 6200% for a single query to shows the cycles spent inside the enclave (including en- SGX highlights the potential efficiency of Cloak. tering and leaving the enclave) per input record averaged over ten runs for differently sized input batches. These batches were randomly drawn from the data set. The 7.2 Service Contracts with the OS sizes of the batches ranged from 5 to 260. For batches larger than 260, we observed capacity aborts with high Applying the basic Cloak techniques to sensitive enclave probability. Nonetheless, seemingly random capacity code reduces the risk of side-channel attacks. However, aborts could also be observed frequently even for small enclave code is especially vulnerable as the correspond- batch sizes. The number of aborts also increased with ing attacker model (see Section 3) includes malicious higher system load. The cost for restarting transactions system software and hardware attacks. In particular, ma- on aborts is included in Figure 13. licious system software, i.e., the OS, can amplify side- As baseline, we ran inside SGX the same algorithm channel attacks by concurrently (A1) interrupting and re- without special containers and preloading and without suming enclave threads [40], (A2) unmapping enclave transactions. The baseline was compiled with the un- pages [61], (A3) taking control of an enclave thread’s modified version of the Microsoft C++ compiler at the sibling hyper-thread (HT) [11], or (A4) repeatedly reset- highest optimization level. As can be seen, the number ting an enclave. A3 is of particular concern in Cloak of cycles per query greatly decreases with the batch size. as TSX provides requirement R2 (see Section 4) only for Batching is particularly important for Cloak, because the LLC. Hence, code and data in the read set are not pro- it enables the amortization of cache preloading costs. tected against a malicious HT which can perform attacks Overall, the overhead ranges between +79% (batch size over the L1 and L2 caches from outside the enclave. In 5) and +248% (batch size 260). The overhead increases the following, we describe how Cloak-protected enclave code can ensure that the OS is honest and does not mount 5https://archive.ics.uci.edu/ml attacks A1–A4.

228 26th USENIX Security Symposium USENIX Association 7.2.1 Checking the Honesty of the OS countermeasures exists that would mitigate this attack. For example, the two threads could randomly choose a While SGX does not provide functionality to directly different L1 cache set (out of the 64 available) for each check for A1 and A2 or to prevent them, it is simple with bit to transmit. Cloak: our experiments showed in line with Intel’s docu- To protect against A4, the enclave may use SGX’s mentation [31] that transactions abort with code OTHER trusted monotonic counters [3] or require an online con- (no bits set in the abort code) in case of interrupts or ex- nection to its owner on restart. ceptions. In case unexpected aborts of this type occur, Finally, the enclave may demand a private LLC parti- the enclave may terminate itself as a countermeasure. tion, which could be provided by the OS via Intel’s re- Preventing A3 is more involved and requires several cent Cache Allocation Technology (CAT) feature [32] or steps: before executing a transaction, we demand that “cache coloring” [11,37,58]. A contract violation would (i) both HTs of a CPU core enter the enclave and (ii) re- become evident to the enclave through increased num- main there. To enforce (ii), the two threads write a unique bers of aborts with code CONFLICT. marker to each thread’s State Save Area (SSA) [32] in- side the enclave. Whenever a thread leaves an enclave asynchronously (e.g., because of an interrupt), its regis- 8 Limitations and Future Work ters are saved in its SSA [32]. Hence, every unexpected exception or interrupt necessarily overwrites our mark- Cache attacks are just one of many types of side-channel ers in the SSAs. By inspecting the markers, we can thus attacks and Cloak naturally does not mitigate all of them. ensure that neither of the threads was interrupted (and Especially an adversary able to measure the execution potentially maliciously migrated to a different core by time of a transaction might still derive secret information. the OS). One thread now enters a Cloak transaction and Beyond this, Cloak instantiated with Intel TSX may be verifies the two markers, making them part of its read set. vulnerable to additional side channels that have not yet Thus, as we confirmed experimentally, any interruption been explored. We identified five potential side channels of the threads would overwrite an SSA marker in the read that should be investigated in more detail: First, the in- set and cause an immediate transactional abort with code teraction of the read set and the “second level structure” CONFLICT (bit three set in the abort code). (i.e., the bloom filter) is not documented. Second, other Unfortunately, for (i), there is no direct way for en- caches, such as translation-lookaside buffers and branch- clave code to tell if two threads are indeed two corre- prediction tables, may still leak information. Third, the sponding HTs. However, after writing the SSA markers, Intel TSX abort codes may provide side-channel infor- before starting the SSA transaction, the enclave code can mation if accessible to an attacker. Fourth, variants of initially conduct a series of experiments to check that, Prime+Probe that deliberately evict read set cache lines with a certain confidence, the two threads indeed share from L1 to the LLC but not to DRAM could potentially an L1 cache. One way of doing so is to transmit a se- obtain side-channel information without causing trans- cret (derived using the instruction inside the en- action aborts. Fifth, the execution time of transactions clave) over a timing-less L1-based TSX covert channel: including in particular the timings of aborts may leak in- for each bit in the secret, the receiver starts a transac- formation. Finally, it is important to note that Cloak is tion and fills a certain L1 cache set with write-set cache limited by the size of the CPU’s caches, since code and lines and busy-waits within the transaction for a certain data that have secret-dependent accesses must fit in the time; if the current bit is 1, the sender aborts the re- caches. TSX runtime behavior can also be difficult to ceiver’s transaction by touching conflicting cache lines of predict and control for the programmer. the same cache set. Otherwise, it touches non-conflicting cache lines. After the transmission, both threads com- pare their versions of the secret. In case bit-errors are 9 Related Work below a certain threshold, the two threads are assumed Using HTM for Security and Safety. The Mimosa to be corresponding HTs. In our experiments, the covert system [25] uses TSX to protect cryptographic keys in channel achieved a raw capacity of 1 MB/s at an error the Linux kernel against different forms of memory dis- rate of 1.6% between two HTs. For non-HTs, the er- closure attacks. Mimosa builds upon the existing TRE- ror rate was close to 50% in both cases, showing that no SOR system [47], which ensures that a symmetric master cross-core transmission is possible.6 While a malicious key is always kept in the CPU’s debug registers. Mi- OS could attempt to eavesdrop on the sender and replay mosa extends this protection to an arbitrary number of for the receiver to spoil the check, a range of additional (asymmetric) keys. Mimosa always only writes pro- 6Using the read set instead yields a timing-less cross-core covert tected keys to memory within TSX transactions. It en- channel with a raw capacity of 335 KB/s at an error rate of 0.4%. sures that these keys are wiped before the correspond-

USENIX Association 26th USENIX Security Symposium 229 ing transaction commits. This way, the protected keys ing in multi-tenant systems. This can either be imple- are never written to RAM. However, Mimosa does not mented through hardware modifications [12, 52], or by prevent cache side-channel attacks. Instead, for AES dynamically separating resources. Shi et al. [58] and computations it uses AES-NI, which does not leak in- Kim et al. [37] propose to use cache coloring to isolate formation through the cache. However, a cache attack on different tenants in cloud environments. Zhang et al. [68] the square-and-multiply routine of RSA in the presence propose cache cleansing as a technique to remove infor- of Mimosa would still be possible. To detect hardware mation leakage from time-shared caches. Godfrey et al. faults, the HAFT system [39] inserts redundant instruc- [18] propose temporal isolation through scheduling and tions into programs and compares their behavior at run- resource isolation through cache coloring. More re- time. HAFT uses TSX to efficiently roll-back state in cently Zhou et al. [69] propose a more dynamic approach case a fault was encountered. where pages are duplicated when multiple processes ac- Probably closest related to Cloak is the recent T-SGX cess them simultaneously. Their approach can make at- approach [59]. It employs TSX to protect SGX enclave tacks significantly more difficult to mount, but not im- code against the page-fault side channel [61], which can possible. Liu et al. [42] propose to use Intel CAT to split be exploited by a malicious OS that unmaps an enclave’s the LLC, avoiding the fundamental resource sharing that memory pages (cf. Section 7). At its core, T-SGX lever- is exploited in many attacks. In contrast to Cloak, all ages the property that exceptions within TSX transac- these approaches require changes on the OS level. tions cause transactional aborts and are not delivered to the OS. T-SGX ensures that virtually all enclave code is executed in transactions. To minimize transactional Detecting Cache Side-Channel Leakage. Other de- aborts, e.g., due to cache-line evictions, T-SGX’s ex- fenses aim at detecting potential side-channel leakage tension of the Clang compiler automatically splits en- and attacks, e.g., by means of static source code analy- clave code into small execution blocks according to a sis [13] or by performing dynamic anomaly detection us- static over-approximation of L1 usage. At runtime, a ing CPU performance counters. Gruss et al. [23] explore springboard dispatches control flow between execution the latter approach and devise a variant of Flush+Reload blocks, wrapping each into a separate TSX transaction. that evades it. Chiappetta et al. [9] combine performance Thus, only page faults related to the springboard can be counter-based detection with machine learning to detect (directly) observed from the outside. All transactional yet unknown attacks. Zhang et al. [65] show how per- aborts are handled by the springboard, which may termi- formance counters can be used in cloud environments to nate the enclave when an attack is suspected. For T-SGX, detect cross-VM side-channel attacks. In contrast, Cloak Shih et al. [59] reported performance overheads of 4%– follows the arguably stronger approach of mitigating at- 108% across a range of algorithms and, due to the strat- tacks before they happen. Many attacks require only a egy of splitting code into small execution blocks, caused small number of traces or even work with single mea- only very low rates of transactional aborts. surements [17, 43, 62]. Thus, Cloak can provide protec- The strategy employed by T-SGX cannot be generally tion where detection mechanisms fail due to the inherent transferred to Cloak, as—for security—one would need detection delays or too coarse heuristics. Further, reliable to reload the code and data of a sensitive function when- performance counters are not available in SGX enclaves. ever a new block is executed. Hence, this strategy is not likely to reduce cache conflicts, which is the main reason for transactional aborts in Cloak, but rather increase per- 10 Conclusions formance overhead. Like T-SGX, the recent Dej´ a` Vu [8] approach also attempts to detect page-fault side-channel We presented Cloak, a new technique that defends attacks from within SGX enclaves using TSX: an en- against cache side-channel attacks using hardware trans- clave thread emulates an non-interruptible clock through actional memory. Cloak enables the efficient retrofitting busy waiting within a TSX transaction and periodically of existing algorithms with strong cache side-channel updating a counter variable. Other enclave threads use protection. We demonstrated the efficacy of our ap- this counter for approximate measuring of their execu- proach by running state-of-the-art cache side-channel at- tion timings along certain control-flow paths. In case tacks on existing vulnerable implementations of algo- these timings exceed certain thresholds, an attack is as- rithms. Cloak successfully blocked all attacks in every sumed. Both T-SGX and Dej´ a` Vu conceptually do not attack scenario. We investigated the imperfections of In- protect against common cache side-channel attacks. tel TSX and discussed the potentially remaining leakage. Finally, we showed that one of the main limitations of Prevention of Resource Sharing. One branch of de- Intel SGX, the lack of side-channel protections, can be fenses against cache attacks tries to reduce resource shar- overcome by using Cloak inside Intel SGX enclaves.

230 26th USENIX Security Symposium USENIX Association References [18]G ODFREY,M.M., AND ZULKERNINE, M. Preventing cache- based side-channel attacks in a cloud environment. IEEE Trans- [1]A CIIC¸ MEZ,O.,GUERON,S., AND SEIFERT,J.-P. New branch actions on Cloud Computing (2014). prediction vulnerabilities in OpenSSL and necessary software [19]G OEL,B.,TITOS-GIL,R.,NEGI,A.,MCKEE,S.A., AND countermeasures. In IMA International Conference on Cryptog- STENSTROM, P. Performance and energy analysis of the re- raphy and Coding (2007). stricted transactional memory implementation on Haswell. In [2]A LLAN, T., BRUMLEY,B.B.,FALKNER,K., VAN DE POL,J., International Conference on Parallel Processing (ICPP) (2014). AND YAROM, Y. Amplifying side channels through performance degradation. In Anual Computer Security Applications Confer- [20]G OTZFRIED¨ ,J.,ECKERT,M.,SCHINZEL,S., AND MULLER¨ , ence (ACSAC) (2016). T. Cache attacks on intel sgx. In European Workshop on System Security (EuroSec) (2017). [3]A NATI,I.,GUERON,S.,JOHNSON,S., AND SCARLATA, V. Innovative technology for CPU based attestation and sealing. In [21]G RUSS,D.,MAURICE,C.,FOGH,A.,LIPP,M., AND MAN- International Workshop on Hardware and Architectural Support GARD, S. Prefetch side-channel attacks: Bypassing SMAP and for Security and Privacy (HASP) (2013). Kernel ASLR. In ACM Conference on Computer and Communi- cations Security (CCS) (2016). [4]B ERNSTEIN, D. J. Cache-timing attacks on AES. Tech. rep., Department of Mathematics, Statistics, and Computer Science, [22]G RUSS,D.,MAURICE,C., AND MANGARD, S. Rowhammer.js: University of Illinois at Chicago, 2005. A remote software-induced fault attack in JavaScript. In Con- ference on Detection of Intrusions and Malware & Vulnerability [5]B HATTACHARYA,S., AND MUKHOPADHYAY, D. Curious case Assessment (DIMVA) of Rowhammer: Flipping secret exponent bits using timing anal- (2016). ysis. In Conference on Cryptographic Hardware and Embedded [23]G RUSS,D.,MAURICE,C.,WAGNER,K., AND MANGARD,S. Systems (CHES) (2016). Flush+Flush: A fast and stealthy cache attack. In Conference on [6]B HATTACHARYA,S.,REBEIRO,C., AND MUKHOPADHYAY,D. Detection of Intrusions and Malware & Vulnerability Assessment Hardware prefetchers leak : A revisit of SVF for cache-timing (DIMVA) (2016). attacks. In IEEE/ACM International Symposium on Microarchi- [24]G RUSS,D.,SPREITZER,R., AND MANGARD, S. Cache tem- tecture (MICRO) (2012). plate attacks: Automating attacks on inclusive last-level caches. [7]B RASSER, F., MULLER¨ ,U.,DMITRIENKO,A.,KOSTIAINEN, In USENIX Security Symposium (2015). K.,CAPKUN,S., AND SADEGHI, A.-R. Software Grand Expo- [25]G UAN,L.,LIN,J.,LUO,B.,JING,J., AND WANG, J. Protect- sure: SGX cache attacks are practical. arXiv:1702.07521 (2017). ing private keys against memory disclosure attacks using hard- [8]C HEN,S.,ZHANG,X.,REITER,M.K., AND ZHANG, Y. De- ware transactional memory. In IEEE Symposium on Security and tecting privileged side-channel attacks in shielded execution with Privacy (S&P) (2015). Dej´ a` Vu. In ACM Symposium on Information, Computer and [26]G ULLASCH,D.,BANGERTER,E., AND KRENN, S. Cache Communications Security (ASIACCS) (2017). Games – bringing access-based cache attacks on AES to practice. [9]C HIAPPETTA,M.,SAVAS,E., AND YILMAZ, C. Real In IEEE Symposium on Security and Privacy (S&P) (2011). time detection of cache-based side-channel attacks using hard- [27]H ERLIHY,M.,ELIOT,J., AND MOSS, B. Transactional mem- ware performance counters. Cryptology ePrint Archive, Report ory: Architectural support for -free data structures. In Inter- 2015/1034, 2015. national Symposium on Computer Architecture (ISCA) (1993). [10]C OSTAN, V., AND DEVADAS, S. Intel SGX explained. Cryptol- ogy ePrint Archive, Report 2016/086. [28]H UND,R.,WILLEMS,C., AND HOLZ, T. Practical timing side channel attacks against kernel space ASLR. In IEEE Symposium [11]C OSTAN, V., LEBEDEV,I., AND DEVADAS, S. Sanctum: on Security and Privacy (S&P) (2013). Minimal hardware extensions for strong software isolation. In USENIX Security Symposium (2016). [29]I NTEL CORP. Software Development Emulator v. 7.49. https://software.intel.com/en-us/articles/intel- [12]D OMNITSER,L.,JALEEL,A.,LOEW,J.,ABU-GHAZALEH, software-development-emulator/ (retrieved 19/01/2017). N., AND PONOMAREV, D. Non-monopolizable caches: Low- complexity mitigation of cache side channel attacks. ACM Trans- [30]I NTEL CORP. Intel 64 and IA-32 architectures optimization ref- actions on Architecture and Code Optimization (TACO) (2011). erence manual, June 2016. [13]D OYCHEV,G.,KOPF¨ ,B.,MAUBORGNE,L., AND REINEKE,J. [31]I NTEL CORP. Intel 64 and IA-32 Architectures Software Devel- CacheAudit: a tool for the static analysis of cache side channels. oper’s Manual, Volume 1: Basic Architecture, September 2016. ACM Transactions on Information and System Security (TISSEC) [32]I NTEL CORP. Intel 64 and IA-32 Architectures Software Devel- (2015). oper’s Manual, Volume 3 (3A, 3B & 3C): System Programming [14]E VTYUSHKIN,D., AND PONOMAREV, D. Covert channels Guide, September 2016. through random number generator: Mechanisms, capacity esti- mation and mitigations. In ACM Conference on Computer and [33]I RAZOQUI,G.,EISENBARTH, T., AND SUNAR, B. S$A: A Communications Security (CCS) (2016). shared cache attack that works across cores and defies VM sand- boxing – and its application to AES. In IEEE Symposium on [15]E VTYUSHKIN,D.,PONOMAREV,D., AND ABU-GHAZALEH, Security and Privacy (S&P) (2015). N. Jump over ASLR: Attacking branch predictors to bypass ASLR. In IEEE/ACM International Symposium on Microarchi- [34]I RAZOQUI,G.,EISENBARTH, T., AND SUNAR, B. Cross pro- tecture (MICRO) (2016). cessor cache attacks. In ACM Symposium on Information, Com- puter and Communications Security (ASIACCS) (2016). [16]F ERRI,C.,BAHAR,R.I.,LOGHI,M., AND PONCINO,M. Energy-optimal synchronization primitives for single-chip multi- [35]I RAZOQUI,G.,INCI,M.S.,EISENBARTH, T., AND SUNAR,B. processors. In ACM Great Lakes Symposium on VLSI (2009). Wait a minute! a fast, cross-VM attack on AES. In Symposium on Recent Advances in Intrusion Detection (RAID) (2014). [17]G E,Q.,YAROM, Y., COCK,D., AND HEISER, G. A sur- vey of microarchitectural timing attacks and countermeasures on [36]J ANG, Y., LEE,S., AND KIM, T. Breaking kernel address space contemporary hardware. Journal of Cryptographic Engineering layout randomization with intel TSX. In ACM Conference on (2016). Computer and Communications Security (CCS) (2016).

USENIX Association 26th USENIX Security Symposium 231 [37]K IM, T., PEINADO,M., AND MAINAR-RUIZ, G. STEALTH- [55]P ESSL, P., GRUSS,D.,MAURICE,C.,SCHWARZ,M., AND MEM: System-level protection against cache-based side channel MANGARD, S. DRAMA: Exploiting DRAM addressing for attacks in the cloud. In USENIX Security Symposium (2012). cross-CPU attacks. In USENIX Security Symposium (2016). [38]K OCHER, P. C. Timing attacks on implementations of Diffe- [56]R ISTENPART, T., TROMER,E.,SHACHAM,H., AND SAVAGE, Hellman, RSA, DSS, and other systems. In Advances in S. Hey, you, get off of my cloud: Exploring information leakage Cryptology—CRYPTO (1996). in third-party compute clouds. In ACM Conference on Computer and Communications Security (CCS) [39]K UVAISKII,D.,FAQEH,R.,BHATOTIA, P., FELBER, P., AND (2009). FETZER, C. HAFT: hardware-assisted fault tolerance. In Euro- [57]S CHWARZ,M.,GRUSS,D.,WEISER,S.,MAURICE,C., AND pean Conference on Computer Systems (EuroSys) (2016). MANGARD, S. Malware Guard Extension: Using SGX to con- Conference on Detection of Intrusions and [40]L EE,S.,SHIH, M.-W., GERA, P., KIM, T., KIM,H., AND ceal cache attacks. In Malware & Vulnerability Assessment (DIMVA) PEINADO, M. Inferring fine-grained control flow inside SGX en- (2017). claves with branch shadowing. arXiv preprint arXiv:1611.06952 [58]S HI,J.,SONG,X.,CHEN,H., AND ZANG, B. Limiting cache- (2016). based side-channel in multi-tenant cloud using dynamic page col- [41]L IPP,M.,GRUSS,D.,SPREITZER,R.,MAURICE,C., AND oring. In IEEE/IFIP International Conference on Dependable MANGARD, S. ARMageddon: Cache attacks on mobile devices. Systems and Networks Workshops (DSN-W) (2011). In USENIX Security Symposium (2016). [59]S HIH, M.-W., LEE,S.,KIM, T., AND PEINADO, M. T- [42]L IU, F., GE,Q.,YAROM, Y., MCKEEN, F., ROZAS,C., SGX: Eradicating controlled-channel attacks against enclave pro- HEISER,G., AND LEE, R. B. Catalyst: Defeating last-level grams. In Symposium on Network and Distributed System Secu- cache side channel attacks in cloud computing. In Interna- rity (NDSS) (2017). tional Symposium on High Performance Computer Architecture [60]S PREITZER,R., AND PLOS, T. Cache-access pattern attack on (HPCA) (2016). disaligned AES T-Tables. In Constructive Side-Channel Analysis [43]L IU, F., YAROM, Y., GE,Q.,HEISER,G., AND LEE,R.B. and Secure Design (COSADE) (2013). Last-level cache side-channel attacks are practical. In IEEE Sym- [61]X U, Y., CUI, W., AND PEINADO, M. Controlled-channel at- posium on Security and Privacy (S&P) (2015). tacks: Deterministic side channels for untrusted operating sys- [44]L IU, Y., XIA, Y., GUAN,H.,ZANG,B., AND CHEN, H. Con- tems. In IEEE Symposium on Security and Privacy (S&P) (2015). current and consistent virtual machine introspection with hard- [62]Y AROM, Y., AND FALKNER, K. Flush+Reload: a high resolu- ware transactional memory. In International Symposium on High tion, low noise, L3 cache side-channel attack. In USENIX Secu- Performance Computer Architecture (HPCA) (2014). rity Symposium (2014). AURICE EBER CHWARZ INER [45]M ,C.,W ,M.,S ,M.,G ,L., [63]Y OO,R.M.,HUGHES,C.J.,LAI,K., AND RAJWAR, R. Per- GRUSS,D.,ALBERTO BOANO,C.,MANGARD,S., AND formance evaluation of Intel transactional synchronization exten- ¨ ROMER, K. Hello from the other side: SSH over robust cache sions for high-performance computing. In International Confer- covert channels in the cloud. In Symposium on Network and Dis- ence for High Performance Computing, Networking, Storage and tributed System Security (NDSS) (2017). Analysis (SC) (2013). [46]M OGHIMI,A.,IRAZOQUI,G., AND EISENBARTH, T. [64]Z ACHAROPOULOS, G. Employing hardware transactional mem- CacheZoom: How SGX amplifies the power of cache attacks. ory in prefetching for energy efficiency. Uppsala Universitet arXiv:1703.06986 (2017). (report), 2015. http://www.diva-portal.org/smash/get/ [47]M ULLER¨ , T., FREILING, F. C., AND DEWALD, A. TRESOR diva2:847611/FULLTEXT01.pdf (retrieved 20/02/2017). runs encryption securely outside RAM. In USENIX Security Sym- [65]Z HANG, T., ZHANG, Y., AND LEE, R. B. CloudRadar: A real- posium (2011). time side-channel attack detection system in clouds. In Sympo- [48]N AKAIKE, T., ODAIRA,R.,GAUDET,M.,MICHAEL,M.M., sium on Recent Advances in Intrusion Detection (RAID) (2016). AND TOMARI, H. Quantitative comparison of hardware trans- [66]Z HANG,X.,XIAO, Y., AND ZHANG, Y. Return-oriented flush- actional memory for Blue Gene/Q, zEnterprise EC12, Intel Core, reload side channels on ARM and their implications for Android and POWER8. In International Symposium on Computer Archi- devices. In ACM Conference on Computer and Communications tecture (ISCA) (2015). Security (CCS) (2016). [49]O HRIMENKO,O.,SCHUSTER, F., FOURNET,C.,MEHTA,A., [67]Z HANG, Y., JUELS,A.,REITER,M.K., AND RISTENPART, NOWOZIN,S.,VASWANI,K., AND COSTA, M. Oblivious multi- T. Cross-tenant side-channel attacks in PaaS clouds. In ACM party machine learning on trusted processors. In USENIX Secu- Conference on Computer and Communications Security (CCS) rity Symposium (2016). (2014). [50]O REN, Y., KEMERLIS, V. P., SETHUMADHAVAN,S., AND [68]Z HANG, Y., AND REITER,M.Duppel:¨ retrofitting commodity KEROMYTIS, A. D. The spy in the sandbox: Practical cache operating systems to mitigate cache side channels in the cloud. attacks in JavaScript and their implications. In ACM Conference In ACM Conference on Computer and Communications Security on Computer and Communications Security (CCS) (2015). (CCS) (2013). [51]O SVIK,D.A.,SHAMIR,A., AND TROMER, E. Cache attacks and countermeasures: the case of AES. In RSA Conference Cryp- [69]Z HOU,Z.,REITER,M.K., AND ZHANG, Y. A software ap- ACM tographer’s Track (CT-RSA) (2006). proach to defeating side channels in last-level caches. In Conference on Computer and Communications Security (CCS) [52]P AGE, D. Partitioned cache architecture as a side-channel de- (2016). fence mechanism. Cryptology ePrint Archive, Report 2005/280. [53]P AYER, M. HexPADS: a platform to detect “stealth” attacks. In International Symposium on Engineering Secure Software and Systems (ESSoS) (2016). [54]P ERCIVAL, C. Cache missing for fun and profit. In Proceedings of BSDCan (2005).

232 26th USENIX Security Symposium USENIX Association A Cloak Code Example Listing 2: Decision tree classification before and after Cloak: the code in black is shared by both versions, the code before Cloak is in dark gray(lines 1–3), and Cloak- specific additions are in blue (lines 5–7, 11, 12, 15).

1 using Nodes = nelem_t*; 2 using Queries = Matrix; 3 using LeafIds = uint16_t*; 4 5 using Nodes = ReadArray; 6 using Queries = ReadMatrix; 7 using LeafIds = WriteArray; Listing 2 gives an example of the original code for tree 8 traversal and its Cloak-protected counterpart. In the orig- 9 void tsx protected lookup_leafids( inal code, a tree is stored in a Nodes array where each 10 Nodes& nodes, Queries& queries, LeafIds& leafids ) { node contains a feature, fdim, and a threshold, fthres. 11 nodes.preload(); Access to a node determines which feature is used to 12 queries.preload(); make a split and its threshold on the value of this feature 13 indicates whether the traversal continues left or right. For 14 for (size_t q=0; q < queries.entries(); q ++) { every record batched in Queries, the code traverses the 15 if (!(q % 8)) leafids.preload(); tree according to feature values in the record. Once a 16 size_t idx = 0, left, right; leaf is reached its value is written as the output of this 17 for (;;) { query in LeafIds. The following features of Cloak are 18 auto &node = nodes[idx]; 19 left = node.left; used to protect code and data accesses of the tree traver- 20 right = node.right_or_leafid; sal. First, it uses Cloak data types to allocate Nodes and 21 if (left == node) { Queries in the read set and LeafIds in the write sets. 22 leafids[q] = right; This ensures that data is allocated as described in Sec- 23 break ; 24 } tion 5.2.1, oblivious from the programmer. The parame- 25 if (queries.item(q, node.fdim) <= ters NCS R and NCS W indicate the number of cache sets to node.fthresh) be used for the read and write sets. Second, the program- 26 idx = left ; mer indicates to the compiler which function should be 27 else tsx protected 28 idx = right ; run within a transaction by using anno- 29 } tation. The programmer calls preload (lines 11, 12, and 30 } 15) on sensitive data structures. The repeated preloading 31 } of the writable array leafids in line 15 refreshes the write set to prevent premature evictions.

USENIX Association 26th USENIX Security Symposium 233