STBPU: A Reasonably Safe Branch Predictor Unit

Tao Zhang Timothy Lesch Kenneth Koltermann Dmitry Evtyushkin William & Mary William & Mary William & Mary William & Mary [email protected] [email protected] [email protected] [email protected]

Abstract—Modern processors have suffered a deluge of danger- Some conflicts result in leakage of sensitive data. One such ous side channel and attacks that exploit mechanism is the branch predictor unit (BPU). Substructures vulnerabilities rooted in branch predictor units (BPU). Many within the BPU are typically shared, and the stored data is such attacks exploit the shared use of the BPU between un- related processes, which allows malicious processes to retrieve compressed, prone to various branch collisions. [15], [64]. This sensitive data or enable speculative execution attacks. Attacks enables attacks such as side channels [2], [16], [18] that are ca- that exploit collisions between different branch instructions inside pable of leaking encryption keys or bypassing ASLR, and the the BPU are among the most dangerous. Various protections recently introduced speculative execution attacks [32], [33]. At and mitigations are proposed such as CPU updates, the same time, shared BPUs are beneficial for performance. secured designs, fencing mechanisms, invisible speculations. While some effectively mitigate speculative execution attacks, They allow high utilization of hardware structures to reduce they overlook BPU as an attack vector, leaving BPU prone the cost and enable efficient branch history accumulation. [52]. to malicious collisions and resulting critical penalty such as Therefore, na¨ıve protections which disable sharing or flush advanced micro-op cache attacks. Furthermore, some mitigations BPU structures have high performance overhead. Recent severely hamper the accuracy of the BPU resulting in increased microcode updates introduced as a countermeasure against CPU performance overhead. To address these, we present the secret token branch predictor unit (STBPU), a branch predictor Spectre attacks [27] demonstrated that the overhead from naive¨ design that mitigates collision-based speculative execution attacks protections can be as high as 440% [58], [71]. and BPU side channel whilst incurring little to no performance Despite significant efforts directed towards designing other overhead. STBPU achieves this by customizing inside data repre- secure microarchitectural components e.g., caches [14], [31], sentations for each software entity requiring isolation. To prevent [43], [63], [76], [78], [80] and memory buses [4], [40], more advanced attacks, STBPU monitors hardware events and preemptively changes how STBPU data is stored and interpreted. [66], secure BPU designs remain a handful of attempts [20], [38], [77]. More importantly, none of existing approaches I.INTRODUCTION completely eliminate BPU vulnerabilities. such as recent µop Over the last few decades, a large number of protection caches attacks [65]. We propose Secret-Token Branch Predic- techniques against software attacks have been introduced tion Unit (STBPU), a safe BPU design aimed to protect against making exploitation of traditional attack vectors such as code collision-based BPU attacks and eliminate BPU side channels. injection or return-oriented programming challenging. With a STBPU prevents collision-based side channel and spec- decreasing number of available targets for software attacks, the ulative execution attacks by disallowing software entities attention of adversaries is more frequently drawn to exploitable from creating controlled branch instruction collisions and weaknesses in hardware. Although hardware attacks such as thus affecting each other in an unsafe way. This is done microarchitectural side channels [1], [3], [6], [19], [26], [44], by customizing the branch representation for each software [56], [57], covert channels [17], [24], [49], [54], and power entity in the form of address mappings and by encrypting data analysis [2], [34], [48], [50], [55] attacks have been known stored in BPU. In STBPU, each software entity is provided for a long time, only recently did researchers demonstrate the with a unique, randomly-generated secret token (ST) that arXiv:2108.02156v1 [cs.CR] 4 Aug 2021 true power of microarchitectural attacks with newly discovered customizes data representations. STBPU constantly monitors speculative execution attacks, such as Meltdown [42], [73] active attacks with hardware events and re-randomizes STs to and Spectre [13], [32], [33], [35], [47], [67]. These attacks prevent attackers from reverse-engineering the ST tokens. are based on speculative (transient) execution, a performance This paper makes the following contributions: optimization technique present in nearly all of today’s proces- • We propose, STBPU, a safe BPU design protects against sors. While this technique improving CPU performance, with speculative execution attacks including fast algorithm a carefully crafted exploit, it completely undermines memory attacks, provides strong isolation guarantees to eliminate protection, giving unauthorized users the ability to read arbi- to BPU side channels, and incurs low overheads. trary memory [33], [42] and disable crucial protections [32]. • We provide insight on how to adapt the STBPU design, Microarchitectural attacks are possible because performance invalidating the recent µop cache attacks which can optimizations such as caches, prefetchers, and various predic- bypass mitigations e.g., invisible speculation and fencing. tors were not traditionally designed with security in mind. • We design an automated frame to create lightweight ST- For example, data structures used to implement these mecha- dependent remapping functions and validate STBPU with nisms are commonly shared, making various conflicts possible. in-depth security analysis .

1 - texts on the right show example attacks within the surface(s) the dash-dotted circle, STBPU focuses on protecting collision- - regions cross a variant indicates incomplete protection collision-based based attacks but is not limited to it. The aforementioned new - variants cross multiple surfaces (e.g., spectre-v2) same-address-space µop-cache attacks are interesting cases, showing the STBPU utilize multiple attack primitives (e.g., spectre-v1) legend shows below: design can mitigate more powerful spectre-v1 variant. cross-address-space - colored area : attack group / surface advanced algorithm attacks (e.g., GEM) (e.g., branchscope) - hollow shape: protection type / region future attacks In i see dead µop attack variant 2 [65], an attacker first history-based attack surface transient trojan attacks prime certain micro-op cache sets and performs Spectre-v1 I see dead μops variant 1 collision-based attack surface type of mis-training on a authorization check and a followed I see dead μops variant 2 μop cache attack surface transmitting instruction in a victim method. non-BPU protection region cross-domain same address space leakage cross-smt leakage Next, attacker enters untrusted input. The authorization check other BPU protection region will fail. If enabled, non-BPU mitigations such as fencing STBPU protection region or speculative data flow control will restrict the followed Fig. 1: STBPU protection region v.s. the others instructions from dispatch or being visible. However, they do not prevent this secret-dependent indirect branch from being speculatively fetched, further leaving a footprint in the micro- • We evaluate STBPU performance on advanced prediction op cache. As a result, the attacker can probe the micro-op mechanisms e.g., TAGE SC L and and show cache set and infer the secret. On the other hand, such attacks low overhead even with extreme security settings. become inapplicable with the STBPU design of DSB. This is due to ST remapping makes DSB indexing non-deterministic. II.BACKGROUND &RELATED WORKS As a result, on both priming and probing stage, the attacker A. STBPU’s motivation and scope loses the control on both the entry branch address and DSB Figure 1 lays two main groups of BPU-centered speculative set mapping even within the same address space. execution attacks: collision-based e.g., spectre-w2 and history- DSB caches stream decoded micro-ops from multiple de- based e.g., spectre-v1 utilizing different BPU properties. coders after the branch prediction stage, making it naturally The first group exploits branch collisions (aliasing) that applicable to use ST remapped address for its own indexing appear when two branch instructions located at different without additional latency. As DSB is enabled only through addresses map into the same BPU entry and affect each other’s branch [28], a consistent isolation can be maintained via behavior (target). As depicted in Figure 1 in dotted circles, ST remapping and ST re-randomization. We analyze similar existing secure BPU designs and enhancements tend to resolve mechanism with more complicated BPU attacks in §VI. Since these attacks but all have incomplete protection coverage or STBPU substructure modification will be detailed in §IV, we unsupported performance. We provide detail in §II-D. omit the DSB discussion detail as it is similar and simpler. The second group exploits the most fundamental property of History-based attack such as Spectre-v1 leverages the cor- BPU, to record branch’s statistics and then activate speculative relation between the natural consequence of prediction and execution based on such statistics. This makes BPU trainable the unsafe follow-up front-end acceleration. Thus, we believe by attackers via providing branch execution history. These Spectre-v1 mitigation should be outside of BPU such as by attacks are often mitigated by non-BPU protections, including existing non-BPU protections, an orthogonal safe speculative secured cache designs [60], [79], [81]; shadow structures to execution control, or a safe loop stream detection (LSD) hide the effects of erroneous speculative executions [30], [82]; unit. STBPU benefits these protections by largely reduce their tracking speculative data flow for invisible speculations [41], surface of enforcement, granting better performance. [61], [84] and offline gadget detectors [21], [53], etc. The rest of this paper will focuses on collision-based attacks Depicted as solid square in Figure 1, non-BPU protections and BPU side-channels. Evidenced by a large number of well- trying to cover both groups of attacks often introduce high documented dangerous exploits, they are responsible from costs on performance and resource. More importantly, they critical data leakage to speculatively executed arbitrary code. overlook BPU as an attack vector i.e., BPU is still prone to We categorize dangerous exploits and their mechanisms in collisions, resulting critical penalty in security. For instance, §II-C and analyze the security STBPU against them in §VI. researchers recently discovered micro-op cache (also called decoded stream buffer, DSB) as disclosure primitive [65]. B. BPU Baseline Model Depicted as lightest grep area in Figure 1, This new tim- Below we present a baseline BPU as an example to il- ing channel can leverage BPU for more powerful transient lustrate our STBPU design in practice. The baseline reflects execution attacks that does not require implicit or explicit the branch predictor (including structure sizes) used in Intel data access, bypassing several existing mitigations such as Skylake . Derived based on recent reverse- STT [84] and context-based fencing [72] and voiding their engineering works [16], [18], [33], [36], [46], [85], it repre- security guarantee on both spectre-v1 and spectre-v2. sents a generalization of mechanisms used in modern Intel pro- The incompleteness in protection motivates us to propose, cessors. STBPU can be applied to different branch predictors, STBPU, a safe BPU design that is immune to collision-based e.g., [68], [69], as it adds security guarantees without altering attacks and BPU side channels and lifts non-BPU protections’ any prediction principles. We demonstrate this by adapting inherent overhead towards all-around security. As depicted as the STBPU to protect several advanced predictors such as

2 Return call ret mode where virtual address of branch is used to find a PHT ip: ret (%rsp) ψ RSB entry, and a more complex 2-level mode where the branch 2 a Indir. jump/call BHB virtual address is hashed with global history register (GHR) ip: jmp ($addr) φ enabling accurate prediction complex patterns.

1 5 RSB is used to predict return instructions. The RSB is imple- Direct jump/call target pred ip: jmp + n ψ BTB mented as a fixed size (16-entry) hardware stack [36], [46]. 3 A call instruction pushes a return address on the RSB and a Cond. jump/call taken/nontaken STBPU ip: jcc + n ψ components: return instruction pops it. Similarly to the BTB, RSB stores PHT φ encryption 4 only 32 bits of the target. Due to limited capacity, the RSB can b ψ remapping GHR underflow. In this case, returns are treated as indirect branches, Fig. 2: BPU with STBPU components highlighted and the indirect predictor is utilized for prediction. C. BPU Centered Attacks TAGE-SC-L [70] and Perceptron [29] and show negligible We detail the entire collision-based attack surface in Table I. performance overheads in their STBPU-protected alternatives First, we classify attacks by where adversarial effect takes (with ST_ prefixes) in §VII. place, either within the attacker’s code (home effect) or in the The BPU consists of the following main structures: shift victim’s code (away effect). Secondly, we classify by the kind registers such as the global history register (GHR) and branch of the effect. A collision in BPU structures results in either data history buffer (BHB), branch target buffer (BTB), pattern placed by another software entity being reused, or such data history table (PHT), and return stack buffer (RSB). Different is evicted and replaced. We refer to these as reuse-based and structures are used in combination to generate predictions eviction-based attacks. The table summarizes attacks caused dependent on specific branch types. Figure 2 depicts how these by BPU collisions and their steps. Please note that there can structures are utilized during a BPU lookup with highlighted be different adversarial effects enabled by same type of colli- components that are modified by STBPU. The figure also sion. For instance, a collision in BTB between two different shows several important functions which are referenced later. branches can result in 1) BTB-data reuse, 2) BTB-eviction and Shift registers such as GHR and BHB are used in the BPU 3) activating malicious speculative execution. While 1) and 2) as a low-cost way of retaining complex branch history. GHR results in side channel leakage of branch-related information stores the global history of taken/not-taken branches. BHB 3) is used as part of speculative execution attack to reveal is used by the indirect predictor to store compressed branch victim’s memory contents. As can be seen from the table, address history for indirect branches. It is used as part of BTB there is a diverse range of dangerous collision-based attacks. access to predict targets that depends on both branch virtual STBPU, while eliminating collision-based BPU attacks, can address and branch context (of previously executed branches). substantially improve security properties of . When a branch is executed, its virtual address is folded using There are two BPU features that are present in nearly all XOR and mixed with the current state of BHB [33]. CPUs that make collision-based attacks possible. First, the BTB serves the purpose of caching target addresses of branch BPU data structures are shared among all software executed instructions. It is implemented as an 8-way, 4096-entry table. on a CPU core, enabling branch collisions between different Each entry stores a truncated address of the 32 least significant processes. Second, the BPU operates with compressed virtual address bits of the target. Function 5 is then utilized to convert addresses. For instance, out of 48 bits of branch virtual a 32-bit entry into a 48-bit virtual address during prediction by address, only 30 are utilized. Then, these bits are further combining 16 higher bits from the branch instruction pointer compressed [33]. This allows branch collisions within the with 32 lower bits from BTB. While the BTB is used to store same virtual address space such as collisions between different targets for all branch types, it has two addressing modes. In branches in kernel and user [16]. The deterministic mode one, the virtual address of a branch instruction is used to nature of the BPU makes it possible for an attacker to compute an index and tag. In mode two, in addition to virtual controllably trigger collisions. Our proposed solution aims at address, the BHB is used to perform a lookup. Mode two is eliminating such determinism to prevent attacks. only used when predicting indirect branches, and serves as a fall-back mechanism for predicting returns. This addressing D. Existing Protections Against BPU Attacks enables storing multiple targets for a single indirect branch Branch predictor collision-based attacks can be mitigated in depending on the context [16], [33], [85]. several ways. While some protections may be more specific PHT is a large (16k entry) table consisting of n-bit (e.g. 2- and cover only certain attack scenario, others may provide bit) saturating counters; each implements a simple protection from multiple attacks. For instance to avoid leaking FSM with states ranging from strongly non-taken to strongly- secret key bits from RSA secret key operation, one can taken. This structure is used as a base predictor to predict rewrite the exponentiation calculation code to avoid secret- the direction of conditional branches. Previous studies [7], key dependent branches [5]. If branch predictor is flushed on [18], [25], [85] indicated presence of mechanism similar to context attacks such as Spectre v2, Jump-over-ASLR gshare [83] with two modes of addressing: the simple 1-level and others can be fully mitigated.

3 Reuse-based (RB) Eviction-based (EB) Home effect (HE) Away effect(AE) Home effect (HE) Away effect (AE) BTB: BTB: BTB: BTB: 1. A: jmp s→d; BTB ← (s, d) 1. V: jmp s → d; BTB ← (s, d) 1. A: jmp s→d 0 0 1. V: jmp s → d; BTB ← (s, d) 0 2. V: jmp s’→d’; BTB← (s , d ) 0 0 0 0 2. A: jmp s → d ;(s, d) reused 2. V: jmp s→d’ 0 2. A: jmp s → d ; BTB← (s , d ) 3. A sees misprediction 3. V speculatively executes d |H(s) = H(s ), (s, d) is evicted |H(s) = H(s0) 3. A sees s mispredicted 3. V: CPU uses static prediction PHT: PHT: PHT: 1. V: jt s → d; PHT ← (s, t) 1. A: jnt s → d; PHT ← (s, t) PHT: Attack steps PHT entries are not evicted 2. A: jnt s → s + 1; (s, t) reused 2. V: jt s → d;(s, nt) reused PHT entries are not evicted 3. A sees misprediction 3. V speculatively executes s + 1 RSB: 1. A: call s → d; RSB: RSB: RSB: RSB ← (s + 1) then fills RSB 1. V: call s → d; RSB ← (s + 1) call call 1. V: s → d; RSB ← (s + 1) 1. A: s → d; RSB ← (s + 1) 2. V: call s0 → d0; 2. A: overflows RSB by ret 0 ret 0 call 0 0 2. A: → s ;(s + 1) reused 2. V: → s ;(s + 1) reused RSB ← (s0 + 1) evicting (s + 1) looping s → d 3. A sees misprediction 3. V speculatively executes s + 1 3. A sees misprediction 3. V: CPU uses static prediction Timing channel due to A Timing channel due to A forcing Soure and target branch addresses V’s jmp taken/nontaken [3] and Adversarial controlling predictions in V [3], static default predictions [3], and calls, taken/nontaken call pattens, branch effects speculative execution attacks speculatively execute gadget patterns [3], [16], [18], [36], [39] instruction virtual address [33] [13], [33], [36], [46], [67], [85] at static prediction address [12] A: attacker; V: victim; jmp s → d: jump from s to d; call s → d: call function d from callsite s; BTB/PHT/RSB← (s, d): store target d for branch s in BTB/PHT/RSB; H(): BTB/PHT hash function; s + 1: next instruction after s TABLE I: Attack surface classification for BPU collision-based attacks by event and adversarial effect types

Intel has proposed a set of microcode-based protections with each other. In addition, the key rerandomization schemes which aim to mitigate speculative execution attacks on legacy in BSUP and lack in consideration of critical kinds of CPUs by restricting BPU structure sharing. These protections adversarial effects, making both of them vulnerable to many include Indirect Branch Restricted Speculation (IBRS), Indi- attacks [2], [12], [59], [61], [65]. rect Branch Prediction Barrier (IBPB), and Single Threaded Indirect Branch Prediction (STIBP) [27]. IBRS prevents higher III.THREAT MODEL level processes from speculating with entries provided by We assume a powerful attacker that has a complete un- lower level processes by flushing. IBPB provides protection derstanding of all hardware components and structures in the by flushing the contents of the BPU on context . STBPU. The attacker has access to normal reverse engineer- While effectively removing the interference from the branch ing recourses, such as time measurements and performance prediction on other processes, flushing the BPU removes use- counters, and has access to a wide variety of hardware covert ful history, resulting in high performance reduction [58], [71]. channels. The STBPU design calls for new special purpose Additionally, recent research demonstrated exploitable branch registers as detailed in §IV; the adversary is assumed to be collisions within same address space [65], [85]. Therefore unable to read/modify the contents of these registers. Such role enforcing security only during context switch is not complete. is delegated to a privileged software entity (OS, hypervisor) STIBP logically segments the BPU so that branch prediction of which attacker does not control. on the same physical assemblage does not interfere with each We consider attacks presented in Table I including both side other. In contrast, STBPU works regardless of context switch channel attacks in which victim executes a sensitive data activity. We choose to compare performance of STBPU to dependent branch branch as well as speculative execution protection mechanisms derived from such protections as they, attacks where victim is forced to speculatively execute leakage when properly activated, can provide strong BPU isolation. gadget code. We assume the following two attack scenarios: Figure I shows that existing secure BPU designs are limited Sensitive Process as Victim In this scenario, an attacker tries in protection surface and performance support. Brb [77] stores to learn sensitive data from a victim process by manipulating and reloads the entire history of the directional predictor for the BPU state and recording observations. The attacker has each process which only defends against BranchScope. Exynos control over user-level process co-located on the same CPU security enhancement [20] also utilizes XOR-based encryp- core and is capable of performing activities that are normally tion on branch predictor. However, this approach only pro- allowed to untrusted process such as accessing fine grain tects/encrypts target information and only on indirect branches hardware timers via rdtscp instructions. We assume the and returns that not only allows both high-efficiency branch victim and attacker can either execute on two virtual cores poisoning and BPU side-channel attacks. Although, Exynos within same physical core or share the same logical core with generates CONTEXT_HASH via multiple entropy sources, the time-slicing. This scenario also includes recently introduced computation is deterministic and requires levels of hashing transient trojans [85] where collisions occurring within the and iterative entropy spreading, which not only leaves rooms same memory segments are exploited. for offline replay attacks but introduces a latency of multiple Kernel/VMM as Victim The attacker takes a form of a cycles. BSUP [38] first encrypts the PC and then encrypts the software entity with lower privilege level, i.e. untrusted user entries of BPU. This design is not applicable to SMT scenario, process. The attacker tries to learn sensitive data owned by a and it differs from our work as we create new remapping higher privileged entity (OS kernel or VMM) by manipulating functions to prevent separate software entities from interfering with BPU state and recording observations. Here, victim and

4 Baseline input STBPU input Output Function attacker share same contiguous virtual address space. Attacker 1 32 s 32 ψ, 48 s 9 ind, 8 tag, 5 offs R1(80 7→ 22) is restricted from executing privilege instructions. 2 58 BHB 32 ψ, 58 BHB 8 tag R2(90 7→ 8) 3 32 s 32 ψ, 48 s 14 ind R3(80 7→ 14) GHR GHR ind IV. STBPU DESIGN 4 18 , 32 s 32 ψ, 16 , 48 s 14 R4(96 7→ 14) t 48 s, L(GHR) 32 ψ, 48 s, L(GHR) 10/13 ind, 8/12 tag Rt(80↑ 7→ 25) As discussed in §II, BPU attacks are possible due to p 48 s 32 ψ, 48 s 10 ind Rt(80 7→ 10) L(GHR) — represents geometric series of global history lengths deterministic mapping mechanisms, allowing attackers to cre- ate branch collisions. STBPU aims to stop these attacks by TABLE II: I/O bits for baseline and STBPU functions replacing these deterministic mechanisms with keyed remap- ping mechanisms which prevent branch collision construction. The design philosophy of STBPU is to create different data application which spawns a new process for each incoming representations for separate software entities inside BPU data connection. Since each process executes the same code, data structures. Each software entity requiring isolation is assigned accumulated in the BPU can be used by the newly spawned a unique secret token (ST), which is a random integer that process; this allows the new process to avoid costly BPU controls how branch virtual addresses are mapped into BPU training because the history of the parent process was retained. structures. This ST is also used to encrypt/decrypt stored STBPU permits selective history sharing by allowing OS to data. Compared to na¨ıve protections based on flushing or provide multiple threads of the same program to utilize the partitioning, our approach has a number of benefits. same ST value. However, when sharing is not desired, each Consider a protection scheme where branch target poisoning thread (or child process) can be assigned its own ST. is prevented by flushing the BTB on context switches. Invali- A. Hardware Mechanisms and Interfaces dating the entire branch target history will negatively affect performance in cases where context switches are frequent. To design a secure BPU having comparable performance Similarly, to protect from target collisions between kernel and and hardware cost, we restrict ourselves to only modifying user branches, BTB must be flushed on mode switches (e.g. all BPU mapping mechanisms, adding registers, and encrypting syscalls). Partitioning hardware resources reduces the effective stored targets. The minimalistic nature of proposed changes capacity of BPU structures resulting in a higher miss rate and will facilitate adaptation of our approach. In our design, each lower prediction accuracy. Instead, a customized mapping ap- logical (hyperthreaded) core is provided with an extra register proach allows separate software entities to co-exist in the BPU to store the ST of the current process. Only the OS is allowed with minimal performance overhead; performance evaluated to read/modify these registers, and these registers are inacces- in §VII. STBPU utilizes two key approaches to enable safe sible in unprivileged CPU mode. As such, the OS facilitates resource sharing. First, we ensure that collision creation is history retention across context/mode switches by loading challenging by ensuring all remapping functions are dependent the appropriate STs. We also add several MSRs that store upon both branch address and ST. Second, STBPU detects thresholds and counters for automatic ST re-randomization. when a potential attacker process has recovered sufficient These MSRs monitor the events that indicate an active attacker information that enables deterministic collision creation by process. We monitor two events: 1. branch mispredictions monitoring hardware events. The ST of the current process which includes incorrectly predicted direction of conditional in the BPU is rerandomized once certain (OS controlled) branches and targets of any branch, and 2. BTB evictions. thresholds are reached. Note that in STBPU desgin, the OS In §VI, we explain how these events are utilized to deter is trusted and is responsible for setting parameters such as BPU attacks. Initially, the counter values are set to their re-randomization threshold. This is a common assumption respective threshold values. When an event is observed, the for systems protecting against microarchitectural attacks since corresponding counter is decremented. When a counter reaches compromising OS gives the attacker full control over system zero, the current ST is rerandomized, and the CPU reset the making the side channel attack non-necessary. STBPU can counter with the threshold value. The OS treats these registers be also adapted for systems with OS not trusted (e.g. SGX), as a part of software context saving, and recovering their val- then another system component, such as enclave entering ues on context/mode switches. Modern CPUs have embedded routine can be given access to ST. Alternatively simple logic high performance pseudo-random number generators [45]; We of ST management in STBPU should also enable hardware propose using these for efficient ST re-randomization. only implementation. Re-randomizing ST effectively resets the The ST register is a 64-bit register divided into two 32-bit customization the BPU data representation for that process. chunks, ψ and ϕ. The first chunk ψ acts as a key for a keyed Although branch history is lost in this case (by making it un- remapping functions making BPU mapping unique for each usable), our analysis indicates that such events are infrequent. process. We replace functions 1 , 2 , 3 and 4 in Figure 2 Re-randomizing the ST of one process does not remove the with STBPU remappings R1..4 accordingly. We add functions stored history of a process with a different ST. This is the key Rt and Rp that are used for STBPU implementation with difference comparing to flushing-based approaches. We derive TAGE and Perceptron predictors. Both baseline and STBPU the re-randomization thresholds through the analysis in §VI. remapping functions reduce input data (address, BTB, GHR While potentially dangerous, branch history sharing be- bits) into fixed size index, tag, and offsets used by the BPU to tween programs can benefit performance. Consider a server perform lookups. §V-B describes how R1..4,t,p were selected.

5 Additionally, these functions utilize the entire 48-bit virtual and need to reverse-engineer the rest of the address bits. address unlike legacy functions that use truncated address bits Besides, knowing their own STs does not provide immedi- as inputs. This is crucial in order to prevent same address ate access to collision creation or simplifies collision-based space attacks. Table II summarize all input/output bit changes attacks. In §VI, we show that the number of mispredictions between the baseline and STBPU models. and evictions attackers must incur to successfully infer a ST We use a XOR scheme to encrypt data stored in BPU (to deterministically poison BTB) far exceeds the thresholds structures to stop attackers from redirecting execution to a that will trigger ST re-randomization. Thus, encrypting with a desired speculative gadget even if collisions occur. Instead, more advanced cipher would not increase the level of security. speculative execution will be redirected to a random (and 2) On the other hand, more sophisticated encryption schemes likely invalid) address. In STBPU, every entry stored in BTB introduce significant delays in CPU frontend. For instance, and RSB is XORed with ϕ of the current process. Note that we explored using PRINCE-64 [11] to encrypt stored targets. the baseline BPU stores only 32 bits of target addresses, so While fast, PRINCE-64 will still consume multiple clock the 32-bit ϕ is sufficient for encrypting all stored bits. We use cycles and consume more energy due to higher number of simple XOR encryption for two reasons: 1. XOR operations gates compared to a simple subsingle-cycle XOR operation. 3) are extremely fast with trivial hardware implementation, and Moreover, different input/output bit requirements for various 2. re-randomization of STs makes a simple XOR encryption BPU mechanisms make it challenging to utilize a single hash. an effective attack mitigation (discussed in §VI). To decrypt Therefore, we propose a automated remapping function data in BTB and RSB, we modify function 5 , which XORs generation and optimization approach, enabling STBPU to target bits with ϕ before extending them to 48-bit address. compose R1..4,t,p with all suitable properties.

V. DEVELOPMENTAND OPTIMIZATION OF REMAPPING B. Automation of Remapping Development

In §IV, we defined remapping functions R1..4,t,p which Automated Remap Generation Algorithm. Designing these replace the methods of calculating indices, tags, and offsets remapping mechanisms is a multi-variable optimization prob- for lookup purposes in the baseline BPU model. Remapping lem. Therefore, we use a weighted randomization algorithm to functions R1..4,t,p can be thought of as non-cryptographic hash generate possible remapping functions. Our algorithm takes functions. Given the size constraints of the BPU structures, in a list of hardware constraints, and generates remappings collisions between different inputs to functions R1..4,t,p will that satisfied the supplied constraints. The algorithm contains occur; this fact prevents functions R1..4,t,p from providing a predetermined pool of primitives to use for construction of cryptographic security, regardless of implementation. This potential remapping mechanisms. Each remapping function is inherent weakness is remedied with periodic re-randomization iteratively generated and tested one layer at a time, where a of STs; the security of such re-randomizations are discussed in layer is a block of these primitives. After a layer is added, §VI. The mapping functions used in the baseline model are not the current function is tested against the supplied constraints. fully reverse engineered, but we can safely assume some fast There are three possible scenarios that occur during each compression functions are used with delays of no more than 1 round of testing. 1) The current design satisfies all constraints, clock cycle. Using performance and security as our guides, we and subsequently stored for later optimization. 2) The current placed several important constraints upon functions R1..4,t,p: design violates one or more constraints, and is discarded. 3)

C1 The compute delay for R1..4,t,p must not exceed C clock The current design does not outright violate the constraints, but cycles, where C may vary from CPU to CPU. For our is incomplete. In case 3, our algorithm changes the weights purposes, we chose C to be 1 clock cycle. We enforce used for primitive selection during the creation of the next this by limiting transistor critical path of each function. layer to better improve the current design. C2 The outputs of R1..4,t,p should be uniformly distributed Constraint Selection of C1. Our algorithm requires an input across their respective output spaces. (uniformity) of several variable constraints for the generated remapping C3 The outputs R1..4,t,p must appear to be pseudo-random, functions to satisfy C1. These constraints are: the maximum and the relationship between inputs and outputs should count of transistors along the critical path, the maximum be non-linear. (avalanche effect [23]) number of transistors in parallel (breadth), the maximum number of total transistors for the design, the number of input A. STBPU Remapping v.s. Existing Ciphers and output pins, the maximum number of functional layers We analyzed existing hardware supported hashing mecha- (blocks) the design can have, and the maximum number of nisms, but found none that satisfied our specific requirements wires an arbitrary wire can cross over. for three main reasons: 1) Using stronger ciphers does not Modern processors are designed to perform 15-20 gate directly translate into better security. Existing ciphers are operations in a single cycle [60], which translates to roughly primarily designed to withstand known plaintext/ciphertext 30-45 transistors along the critical path. The delay incurred attacks. However, STBPU threat model is much different by each transistor in the critical path is relatively independent as attackers never observe encrypted addresses (ciphertext) of the CPU clock cycle; therefore, the faster the CPU’s nor partially matched plaintext/ciphertext. They only observe clock cycle, the smaller the number of transistors that can collisions (not knowing with their own or victim’s branch) be completed within 1 clock cycle. Therefore, 45 transistors

6 is the absolute maximum number of transistors we allow in the critical path with preference set for shorter critical paths. Primitive Selection. Much research has been conducted into cryptographic hash primitives [9], [10], [37], [86], [87] that provide building blocks for hash functions with strong proper- ties. We leverage these primitives from SPONGENT [10] and PRESENT [9] hashes. Out of those S-boxes (establishing non- Fig. 3: R1 remapping function construction linearity by substations) are perhaps most critical. To increase Spec. V2 Spec. RSB SASA BTB RB BTB EB BranchScope the simplicity of remapping function generation, we separate ψ       primitives into two categories: non-invertible compression ϕ       primitives and mixing primitives. TABLE III: Token Knowledge Effects on BPU Attacks Non-invertible primitives tend to employ XOR logic gates to obfuscate the relationship between input and output. For many such primitives, multiple inputs generate the same, smaller for different desired metrics may be maximized or minimized. output which makes reverse-engineering difficult. Combining To make all metrics comparable, we normalized each metric multiple non-invertible layers increases complexity of attacks so that the optimal value is 0. We then considered this to aiming to pair a known output to an unknown input. These be a simple weighted optimization problem where we seek primitives compress input size |m| to an output size |n| functions that yield the lowest sum of all metrics recorded where |m| > |n|. Table II shows the disparity between when testing for uniformity and the avalanche effect. Let F the input and output sizes for R1..4,t,p functions, and indi- be a particular function in the group of potential functions G cates the need for optimized compression primitives. Mixing for remapping function Ri, for i ∈ R1..4,t,p: primitives are primarily used to introduce non-linearity to k a hash design which makes deterministically changing the X output by varying the input difficult. These primitives are min wig(F ),F ∈ G (1) primarily composed of |m| 7→ |m| sized S-boxes and P-boxes i (performing permutations). Since the hardware complexity of All weights were set to 1 to avoid prioritizing one metric over S-boxes increases superlinearly with the size of |m|, we limit another. Further prioritizing then can be done by hardware our S-boxes to a maximum of 4 input/outputs. These S- developers for a specific CPU design. For space reasons, we do boxes can be implemented efficiently with combinatorial logic not show the designs for all of R1..4,t,p since they share many or transistor/diode matrices. P-boxes are constrained by the similar characteristics. Instead, we show the chosen design for maximum wire crossover set for the algorithm. R1 in Figure 3 where stages 1, 3, and 5 are substitution layers Validation of Uniformity (C2) and Avalanche Effect (C3) using 4 7→ 4 and 3 7→ 3 S-boxes. For space reasons, not Remappings that satisfy the hardware constraints are then all types of S-boxes are shown. Under the design of R1, we tested against constraints C2 and C3. We first employ the show the logical mappings for S-boxes used by PRESENT and balls and bins analysis and compute the coefficient of variation SPONGENT. P-boxes are n 7→ n in size with the pin mappings (CV) of bins to approximate the uniformity (C2) of the output generated randomly by our remap function generator. C-S space [62]. C3 is satisfied when a remapping adheres to a boxes are compression structures that map |m| bits to an output strict avalanche criterion. To quantify the avalanche effect of size of |n| bits where |m| > |n|. This design of R1 has a F , for each input λ, we generate a set of unique inputs, S, critical path length of 36 transistors, so it is capable of being where each input in S differs from λ by a single bit flip. computed within a single clock cycle. We then compute the hamming distance between F (λ) and VI.SECURITY ANALYSIS F (Si), for all inputs in S. Using these hamming distances, we determine the CV of the hamming distances for a particular Instroduced in §III, an attacker has a complete working λ. We test each F with 1 million random inputs and compute knowledge of all STBPU remapping functions, a full control the average hamming distance for all inputs. The ideal case of execution flow, and being capable of executing branches occurs when: i) the average hamming distance over 1 million to/from any address in the attacker’s address space. The goal of random inputs is roughly 50%. ii) For all inputs, the CV of the attack is to enable a malicious branch instruction collision the average hamming distance for each input is 0. iii) For all that allows to mount one of the attacks presented in §II. bit positions of an output of F , the difference between the min STBPU makes collisions non-deterministic forcing the attacker and max hamming distances for a bit flip in any bit position to either rely on brute force approach to find collisions or to is 0. reverse-engineer the ST value. We assume attacker utilizing recently proposed fast attack algorithms such as GEM [61] C. Optimization and Remapping Selection and PPP [59], designed to attack randomized caches. The final selection of remapping functions R1..4,t,p is pri- An attacker possessing knowledge of their ST (ψ/ϕ) voids marily based upon the results of the previous tests. The result the security provided by the STBPU because they can de- is a multiobjective optimization problem where the ideal state terministically generate outputs with any of the remapping

7 functions used by the STBPU. Table III shows which attacks because tag/offset comparisons are only done if A and V are benefit from knowledge of ST. Table IV consists of a few terms in the same set. This adds uncertainty for reuse-based side used in the analysis. Branch collisions are unavoidable due to channels where the attacker wishes to determine the direction the extreme compression required by R1..4,t,p. This weakness of V since a lack of misprediction by A or V could mean that is alleviated with periodic rerandomization of ST for individual A and V do not collide, or that V was not taken. To increase processes. Several important axioms to consider: the probability that an arbitrary A collides with a static V , the attacker can execute a set of branches S = {b , ..., b } A1 Attackers do not know the numerical outputs of R1..4,t,p. B 1 n A2 Due to A1, all the current state of the STBPU must come where n is large so that one branch in SB might collide with from detection of mispredictions and evictions. V . The probability that one of the branches in S collides with Pn A3 Attacker does not have inherent knowledge or control of V is P (SB ⇒ V ) = i=1 P (SBi ⇒ V ). However, ST of any process. noise is added using this method because it is possible that branches in SB will collide with each other. The probability A. STBPU Attacks that two branches in SB collide can be approximated with birthday complexity because the branches in SB are arbitrary. The attacks described in §II become considerably more In order to ensure that no branches in SB collide with any complex to execute without knowledge of ST used during other branch in SB, the attacker execute the following steps: execution of the attacker or the victim. These attacks can 1) bnew = new address in attacker’s address space. 2) For every be mitigated with targeted rerandomization of ST after a branch bi in SB, execute bi and bnew. 3) If no MISP. between certain number of mispredictions or evictions in the BPU have bi and bnew, SB = SB ∪ {bnew}. In order to achieve a 50% occurred due to A2. probability of collision between A and a branch in SB, the size 1) Target Injection Attacks: Recall that we encrypt the ITO of SB must be 2 . The number of MISPs M and evictions targets stored in the STBTB and STRSB through the following ITO E generated whilst generating SB of size n = 2 can be means: EA = ϕa ⊕ τA. With Spectre V2, the attacker approximated as follows: supplies a malicious τA using branch A that collides with n j=i the victim’s branch V causing V to speculate with τA. With X X 1 1 n(n + 1)) M ≈ · = the SpectreRSB, the attacker places a malicious return address p π I p π TO 2p π I · p π TO i=0 j=0 2 2 2 2 τA on the stack that the victim speculates with. In both cases, (2) ITO the target the victim will use from the STBTB or STRSB E ≈ − IW is now τV = ϕa ⊕ τA ⊕ ϕv. If there is a Spectre gadget 2 located in the victim’s address space at address G, the attack Note the reuse-based side channel attacks on PHT do not is successful if τV = G. Due to A3, the attacker does generate evictions. The size of the STBTB is IW which is not have knowledge or control of ϕa or ϕv; consequently, ITO significantly smaller than 2 , so entries in the STBTB will the only variable the attacker can change is the address of constantly be evicted as the attacker grows SB. τA to make τV = G. The probability that τA results in Attacks such as BranchScope [18] and BlueThunder [25] are 1 1 τV = G is or . As such, the attacker must ΩSTBTB ΩST RSB viable against processors using hybrid directional predictors ΩSTBTB ΩST RSB execute 2 or 2 different τA values to have a 50% largely in part due to the inclusion of a base directional chance of successfully executing their target injection attack. predictor in these hybrid BPUs. Due to the complexity of Each incorrect τA will result in the misprediction counter TAGE tables and Perceptron weights, it is significantly easier decrementing towards zero. to maliciously modify the base directional predictor than 2) Reuse-based Attacks: Address mappings are randomized the complex TAGE/Perceptron structures. Since the remap- so that there is only a probability that an arbitrary A and V ping mechanisms used in our TAGE/Perceptron structures will collide in the STBPU. Even though A and V are mapped are different than the remapping functions used for the base with R1..4,t,p, the probability that attacker branch A collides directional predictor, little information is gained by an attacker with victim branch V in the STBTB/STPHT is not bound by observing mispredictions from both the base and complex birthday complexity because V is a static, specific address. directional components. Due to A1, an attacker will not 1 1 The probability of collision is P (A ⇒ V ) = ( I )( TO ). Note, know which TAGE bank or Perceptron weight set produced a we break up the probability that A and V are in the same set prediction. The thresholds for rerandomization stemming from vs. the probability that A and V have matching tag and offsets mispredictions from the directional predictor are based on the least complex attack on the directional predictor so as to also (Name: Description) A: Branch in attacker(A)’s address space protect against more complex directional predictor attacks. Wstruct: Number of ways V : Branch in victim(V)’s address space 3) Same Address Space Attacks: Recently discovered same Istruct: Number of sets (indices) ψa/v: A/V R() 32-bit token address space attacks [85] are classified as target injection Tstruct: Entry tag bit entropy ϕa/v: A/V target encryption token attacks, but in this case both A and V are located inside O : Entry offset bit entropy τ : Target of arbitrary branch Q struct Q the attacker’s address space. As such, encrypting the target Ωstruct: Entry target bit entropy Q EQ: Entry stored for arbitrary branch Q of A with ϕa provides no security because V will decrypt TABLE IV: Useful IDs and Descriptions τA with ϕa. However, due to Ri, there is only a probability

8 that A and V will collide; this probability is the same as in has 8 ways and 512 sets. The stored entries have a compressed §VI-A2. Therefore, the number of mispredictions and evictions 8-bit tag and a 5 bit offset. The STPHT has 1 way and 214 sets. generated while performing a same address space attack are Using Equation 2, the number of mispredictions and evictions also approximated by Equation (2). an attacker will likely generate to perform a successful STBTB 4) Eviction-based Attacks: The attacker cannot determinis- RB side channel attack is 6.9 × 108 and ≈ 221, respectively. tically create STBTB eviction sets without knowing ψa since Correspondingly, for an STPHT RB side channel, the number address mappings change when ψa is rerandomized. With of triggered mispredictions is ≈838 000. For an STBTB EB Wstbtb ways, detecting an eviction in an arbitrary set requires side channel, the average number of evictions required needs I 5 Wstbtb +1 colliding branches (same index, different tag and/or to be found is 2 or 5.3 × 10 per Equation 4. For Spectre V2 offset). The attacker wants to fill STBTB sets so that if V and SpectreRSB, the number of mispredictions generated is is executed, it disturbs one of the attacker’s primed sets. To ≈ 231. To prevent attacks we use the lowest misprediction and increase the chances that V will enter a primed set, the attacker eviction thresholds as the upper bounds for rerandomization must prime as many sets as possible. Assuming the ideal case of ST when evaluating the performance of STBPU. when the attackers does not have conflicts between their own branches, they need to cover P ∗I sets to achieve P probability B. Support for Future Attacks and Defenses of a successful attack. For example, the probability that A Retaining flexibility in STBPU is not only essential to keep 1 enters the same set as a static V is I , so to have a 50% up with evolving attacks in the security arms race but also chance of priming the set V enters, the attacker must prime important to improve performance with advanced defensive I 2 sets. Naively, the probability of randomly guessing Wstbtb additions. Thus, we designate the role of ST management to branches to form a single set of branches Se that enters the system software (OS) while providing basic hardware support same STBTB set is: such as automatic even counter and ST re-randomization. The 1 OS is responsible for setting re-randomization thresholds of P (Se) = (3) IWstbtb−1 each software entity, enabling it to adjust the level of security. For instance, if a process does not require BPU isolation, the Since this probability is not favorable, the attacker could OS can disable event counting. Should more optimal BPU apply a fast algorithm GEM [61] to construct every eviction attacks be discovered or weaknesses found in other compo- set. The attackers uses GEM because bottom-up strategies like nents, the OS can reduce the thresholds forcing more frequent PPP becomes less efficient without a partitioned randomized rerandomizations to neutralize new threats. In the extreme structure [59]. We assume the ideal scenario for the attacker is case, the OS can set threshold to 0 to forcing re-randomization when most of the branches tested follow a perfect uniformity. after every BPU event. Moreover, we suggest to combine In this case, given a particular branch, the probability to have STBPU with external hardware attack detectors [53]. This can W branches belonging to the same set is directly related further reduce STBPU cost by updating event counters only to the total number of test entrires. For instance, there is when unique attack patterns are detected. a 50% probability that in a group of IW branches that at 2 In threat models where OS is not trusted, such as in case least W branches share the same index. Thus, in order to of SGX enclaves, the role of ST management is given to the achieve P attack rate, the attacker needs to test at least PIW enclave entrance routine. In this case, the code running in branches as the initial set since the total attack lines in L in SGX mode is given permissions to access ST and threshold GEM. (L should be larger than 44 according the GEM). With registers. Before entering the enclave and after exiting it, the the origin setting in GEM, the attacker sets the group size ST register is zeroed out. This prevents a rogue OS from G = W +1 and start to eliminate group of branches. Although reading enclave’s ST as well as malicious enclave from spying the total branch accesses will be approximately 2.3 · W · L, on ST of other processes. At the same time the enclave the total eviction number will be less as the majority of the can retain ST state between enclave activations improving probe during each iteration will be hit. Since the probability performance and if necessary, re-randomize the token. that each group will produce an eviction is approximately equal to 1 − 1/e. The evictions generated by testing will be VII.EVALUATION OF STBPUDESIGN negligible as (W + 1) · 1 − 1 · n since the total rounds n for e Realistically evaluating BPU design is a challenging task. GEM converge on the list of conflicting lines are relatively First, sharing BPU resources creates various possibilities for small. However, when first placing L branches, the attacker branch conflicts affecting prediction accuracy. Moreover, in has to trigger the same amount of evictions. Summarizing the BPU designs with safety mechanisms (e.g., IBRS, IBPB), procedure to construct required eviction sets above, we can normal system activity may drastically affect performance. For now approximate evictions numbers generated whilst building instance, to avoid BPU state leakage between user processes sets for P attack rate as follows: and kernel, BPU resources need to be flushed upon con- 1 E ≈ PI × (PIW + (W + 1) × (1 − ) × 3) (4) text/mode switches. As a result, programs that cause frequent e mode or context switches may experience worse performance 5) Thresholds for Baseline Model: STBPU has the same comparing to standard benchmark suits that are composed parameters as the baseline Intel Skylake+ BPU. The STBTB of computation bound applications. Thus, a good evaluation

9 average code protection2 average code protection average conservative average STBPU code protection2 code protection conservative STBPU

1.0 0.99 0.88 0.82 0.8 0.77

0.6

0.4

normalized by baseline 0.2 OAE Prediction Accuracy

521.wrf 557.xz 502.gcc 505.mcf 519.lbm 525.x264 541.leela544.nab 554.roms 508.namd510.parest511.povray 527.cam4 503.bwaves 520.omnetpp 526.blender 538.imagick 500.perlbench 507.cactuBSSN 523.xalancbmk 531.deepsjeng 548.exchange2549.fotonik3d obsstudio_30s chrome-1jetstream mysql_128con_50smysql_256con_50smysql_32con_50smysql_64con_50s apache2_prefork_c128apache2_prefork_c256apache2_prefork_c32apache2_prefork_c512apache2_prefork_c64chrome-1je_1mo_1spchrome-1motionmarkchrome-1speedometer Fig. 4: Comparison of overall STBPU accuracy against other safe BPU models

avg-reduction-ST_PerceptronBP avg-reduction-ST_TAGE_SC_L_64KB reduction-ST_PerceptronBP reduction-ST_TAGE_SC_L_64KB avg-reduction-ST_SKLCond avg-reduction-ST_TAGE_SC_L_8KB reduction-ST_SKLCond reduction-ST_TAGE_SC_L_8KB order to evaluate microarchitectural performance effects, we

0.030 0.07 0.04 implemented the STBPU mechanisms inside gem5 [8] and 0.025 conducted simulations in gem5 syscall-emulation (SE) mode 0.020 0.015 using DerivO3CPU model with configurations that mimics 0.011 0.010 0.0090.01 a modern Skylake . The detailed specification is 0.005 0.001 0.000 shown in Table V. All gem5 simulations were performed 0.005

comparing to unprotected designs with simulating 110 million instructions with a warm-up of 0.010 -0.06 -0.06 Reduction of Direction Prediction Rate

xz mcf nab wrf lbm 10 million instructions. We evaluate performance effects of x264 roms leela cam4 namd parest bwaves imagick blender fotonik3d exchange2 deepsjeng xalancbmk STBPU on two simulators. We use trace-based simulator to avg-reduction-ST_PerceptronBP avg-reduction-ST_TAGE_SC_L_64KB reduction-ST_PerceptronBP reduction-ST_TAGE_SC_L_64KB avg-reduction-ST_SKLCond avg-reduction-ST_TAGE_SC_L_8KB reduction-ST_SKLCond reduction-ST_TAGE_SC_L_8KB capture system wide effects and then validate our results using 0.05 0.08 0.050.07 0.05 0.090.07 0.08 0.07 0.05 detailed cycle accurate simulator This approach provides a 0.04

0.03 good coverage evaluating various performance effects.

0.02 0.018 0.017 0.012 0.01 A. Rerandomization Threshold Optimization 0.00 -0.001 0.01 In §VI, we demonstrate the misprediction and eviction comparing to unprotected designs Reduction of Target Prediction Rate 0.02 -0.08 -0.02 -0.06 -0.05-0.03 -0.08 -0.04 thresholds for ST rerandomization when various STBPU at-

xz x264 roms mcf nab wrf leela lbm cam4 namd parest bwaves imagick blender P fotonik3d exchange2deepsjeng xalancbmk tacks have attack success rate. For BranchScope, to have avg-norm-ST_PerceptronBP avg-norm-ST_TAGE_SC_L_64KB norm-ST_PerceptronBP norm-ST_TAGE_SC_L_64KB avg-norm-ST_SKLCond avg-norm-ST_TAGE_SC_L_8KB norm-ST_SKLCond norm-ST_TAGE_SC_L_8KB a 50% chance of success, the number of mispredictions 1.10 1.14 1.25 1.19 1.35 generated is ≈838 000. For an STBTB EB side channel, the 1.066 5 1.05 number of evictions generated is ≈ 5.3 × 10 . These are the

1.00 lowest numbers for mispredictions and evictions generated by 0.984 0.977 0.969 the attacks discussed in this paper. Indeed, we would like 0.95 Normalized IPC to rerandomize ST well before the attacker has a reasonable over unprotected designs 0.90 probability of a successful attack. To do so, we utilize results 0.85 xz mcf nab wrf lbm from previously discussed security analysis and derive the x264 roms leela cam4 namd parest bwaves imagick blender fotonik3d exchange2 deepsjeng xalancbmk rerandomization thresholds as follows: We first denote the Fig. 5: Gem5 Single Workload Evaluation of STBPUs attack complexity C as the least number of evictions or mispredictions that the attack needs to trigger in order to succeed the attack with 50% chance. Please note, we use 50% environment needs to capture system-wide events and include probability rather than 100% (full brute-force) since on aver- real-world applications. A trace-based simulation is a logical age the attacker will succeed with half the number of attempts choice for this. Meanwhile, complex performance side effects needed for the full exhaustive key search. Let the variable r from branch mispredictions require to use detailed cycle be the attack difficulty factor, and Γ be the rerandomization accurate simulators in order to get accurate performance (e.g. threshold. As such, Γ = r · C. An attack has a 50% success IPC) data. To address both aspects of simulation we evaluate rate when r = 1. For instance, if r = 0.1, the rerandomization STBPU using two separate simulation frameworks. thresholds for mispredictions and evictions are 83 000 and First we utilize Intel PT technology to collect large amounts 53 000, while 41 500 and 26 500 when r = 0.05. We set r to of branch instruction traces captured from all applications 0.05 and derive rerandomization from this value since it offers within the same physical assemblage, including user applica- strong security guarantees with low performance impact. tions that cause frequent mode switches and context switches B. STBPU Performance Evaluation and benchmarks, executed in real time. This is followed by passing these traces through an in-house BPU simulator 1) Prediction Accuracy with real branch trace: Here, we modeled after BPU found in Intel Skylake processor. The evaluate the impact on BPU accuracy from STBPU mech- simulator then reports the BPU accuracy data. Second, in anisms and compare it to existing na¨ıve protections modeled

10 avg-reduction-ST_PerceptronBP avg-reduction-ST_TAGE_SC_L_64KB reduction-ST_PerceptronBP reduction-ST_TAGE_SC_L_64KB after microcode protections and based on flushing or partition- avg-reduction-ST_SKLCond avg-reduction-ST_TAGE_SC_L_8KB reduction-ST_SKLCond reduction-ST_TAGE_SC_L_8KB ing BPU resources. It avoids simulating the complex state of 0.05 0.05 0.06 0.06 0.06 0.07 0.06 0.05 0.06 0.07 0.06 0.05 0.06 0.05 microarchitectural components. Instead, it is designed to allow 0.04 0.038 0.03 a rapid testing of BPU models adopted with the STBPU design 0.02 0.019 0.016 0.013 against a variety of real-world scenarios. 0.01 Each instance of simulation collected from an 0.00 0.01 -0.03 -0.01 -0.01 comparing to unprotected designs i7-8550U machine captures traces from a live physical core Reduction of Direction Prediction Rate

lbm_xz nab_xz mcf_xz lbm_mcf leela_mcf lbm_namd cam4_mcf bwaves_wrfleela_namd bwaves_lbm bwavesoms and include any OS/library code executed including naturally bwaves_leela bwaves_cam4 bwaves_namd fotonik3d_mcf fotonik3d_lbm deepsjeng_xz exchange2_nab exchange2_mcf deepsjeng_lbm exchange2_lbmbwaves_povrayfotonik3d_leela exchange2_leela fotonik3d_namd bwaves_fotonik3d exchange2_namd bwaves_cactuBSSN bwaves_deepsjeng bwaves_exchange2 occurring context, mode switches and interrupts. This allows exchange2_fotonik3d avg-reduction-ST_PerceptronBP avg-reduction-ST_TAGE_SC_L_64KB reduction-ST_PerceptronBP reduction-ST_TAGE_SC_L_64KB a realistic simulation of complex cross-process BPU effects avg-reduction-ST_SKLCond avg-reduction-ST_TAGE_SC_L_8KB reduction-ST_SKLCond reduction-ST_TAGE_SC_L_8KB and assessing of how flushing or ST rerandomization affects 0.05 0.090.07 0.05 0.13 0.070.060.06 0.09 0.110.06 0.110.05 0.06 0.05 0.090.060.060.09 0.04 0.037 performance. For single task centralized scenarios, we collect 0.03

0.021 23 SPEC CPU 2017 traces. In addition, we capture traces from 0.02 0.017 0.01 user and server applications, including Apache2 workloads 0.004 0.00 under different prefork settings and Google Chrome traces of 0.01 -0.06 -0.02-0.04 -0.02-0.02 comparing to unprotected designs Reduction of Target Prediction Rate running single or multiple browser workloads, etc. lbm_xz nab_xz mcf_xz lbm_mcf leela_mcf lbm_namd cam4_mcf bwaves_wrfleela_namd bwaves_lbm bwavesoms bwaves_leelabwaves_cam4 bwaves_namd fotonik3d_mcf fotonik3d_lbm deepsjeng_xz exchange2_nab exchange2_mcf deepsjeng_lbm exchange2_lbmbwaves_povrayfotonik3d_leela Introduced in §II-B, our baseline BPU model combines exchange2_leela fotonik3d_namd bwaves_fotonik3d exchange2_namd bwaves_cactuBSSN bwaves_deepsjeng bwaves_exchange2 exchange2_fotonik3d recent reverse-engineering insights of Intel processors [16], avg-norm-ST_PerceptronBP avg-norm-ST_TAGE_SC_L_64KB norm-ST_PerceptronBP norm-ST_TAGE_SC_L_64KB avg-norm-ST_SKLCond avg-norm-ST_TAGE_SC_L_8KB norm-ST_SKLCond norm-ST_TAGE_SC_L_8KB

[18], [33], [36], [46], [85]. We applied the ST protections from 1.050 1.06 1.12 1.05 1.08 1.09 1.09

1.025 §IV-A to the baseline model for STBPU implementation. We 1.009 1.000 0.981 also created two models that mimic the baseline model with 0.975 0.98 0.950 0.951

Intel’s microcode-based protections namely µcode protection 0.925 1 and 2, modeling IPBP+IBRS protection with and without 0.900 Normalized IPC (H.means) over unprotected designs 0.875

STIBP. Please note that microcode-based protections cannot 0.850

lbm_xz nab_xz mcf_xz lbm_mcf prevent branch collisions occurring within same context. To leela_mcf cam4_mcf lbm_namd bwaves_wrfleela_namd bwaves_lbm bwavesoms bwaves_leelabwaves_cam4 bwaves_namd fotonik3d_mcf fotonik3d_lbm deepsjeng_xz exchange2_nab exchange2_mcf deepsjeng_lbm exchange2_lbmbwaves_povrayfotonik3d_leela exchange2_leela fotonik3d_namd bwaves_fotonik3d exchange2_namd prevent such collisions, more structural BPU changes are bwaves_cactuBSSN bwaves_deepsjeng bwaves_exchange2 exchange2_fotonik3d required. In particular, instead of storing compressed and Fig. 6: Gem5 Multi-workload (SMT) Evaluation of STBPU truncated addresses in BTB, the full 48-bit address must be stored. As a result, a number of entries the BTB is capable of storing must be reduced (assuming unchanged hardware demonstrate the consistency of accuracy between gem5 and budget). We refer to such model as conservative, which fully our previous evaluation, we also ported and tested our baseline prevents any known collision-based BPU attack by flushing model in §VII-B1 as SKLCond model. We compared the or partitioning. Please note that STBPU achieve same security direction prediction accuracy between SKLCond in gem5 with level, but via customizing data representations inside BPU our previous baseline model results on the same SPEC 2017 achieving significantly better performance results. workloads. We observed below 5% direction prediction differ- The result from simulating the above five models is demon- ence on average which validates our simulator’s consistency. strated in Figure 4 where we aggregate all the effective pre- dictions into a single metric: overall accuracy effective (OAE) We treated the aforementioned four BPUs as baseline mod- which counts a branch correctly predicted if all necessary els and implemented four STBPU models. In single process predictions are correct (target and direction), otherwise it’s evaluations, we simulated each pair of STBPU models and counted as mispredicted. Figure 4 shows the overall accuracy their non-ST counterparts across 18 SPEC2017 workloads. of the various BPU models against the SPEC2017 benchmarks Figure 5 illustrates the reduction of direction / target predic- and user applications. Combined, STBPU incurred an overall tions rate and the normalized IPC between STBPU designs effective prediction accuracy penalty of less than 1.3%. For and their non-secure counterparts. We observe all 4 STBPU comparison, the microcode and the conservative BPU models designs can achieve less than 2% reduction on average target suffer at least around 12% overall accuracy loss and have prediction rate and less than 1.3% reduction on average of di- multiple single cases of nearly 30% reductions. With this, rection prediction rate. The less than 4% average IPC reduction we conclude that microcode protections using flushing or demonstrates the high effectiveness of STBPU designs. partitioning of the BPU are too heavy-handed [22], [74], [75], We used the same eight BPU models in our gem5 SMT and significantly reduce BPU accuracy. simulations. Instead of running a single workload at a time, we 2) Cycle Accurate Evaluation using gem5: Our next evalu- ation focuses on the comprehensive impact of STBPU on Out- ISA single load simulation: , smt simulation: Alpha, both at 3.4GHz of-Order (OoO) CPU in terms of cycle accurate performance, BPU BTB entries: 4096, 8-way, RAS size: 16 Core 8-issue, OoO, IQ/LQ/SQ entries: 64/32/32, ROB: 192, ITLB/DTLB: 64/64 protecting advanced branch predictors, and SMT performance. Cache L1-I/L1-D: 32KB/32KB both 8-way, L2: 256KB 4-way, 16-way LLC We tested three advanced BPU models: TAGE SC L 8KB, TAGE SC L 64KB [70], and PerceptronBP [29]. To TABLE V: Parameters of the gem5 simulated architecture

11 events like evictions, resulting in more frequent ST updates.

VIII.CONCLUSIONS &ACCESSIBILITY We presented the STBPU, a safe branch prediction design that defends against BPU side channel and speculative execu- tion attacks. We performed an systematization of BPU related attacks and provided a detailed security analysis of STBPU agaisnt the most recent advanced attacks. While retaining secuity, we demonstrate the STBPU’s high effeciency with both real-world traces and advanced BPU models. We plan to release our simulation tool, gem5 modifications, and more details in STBPU desgin, evaluations, and hw estimation. Fig. 7: Performance impacts with aggressive r thresholds of REFERENCES tage SC L 64KB in gem5 SMT mode, result averaged from 42 combinations of SPEC CPU 2017 workload pairs [1] O. Acıic¸mez, B. B. Brumley, and P. Grabher, “New results on instruction cache attacks,” in International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 2010, pp. 110–124. [2] O. Aciic¸mez, C¸. K. Koc¸, and J.-P. Seifert, “On the power of simple mixed all compatible SPEC workloads in pairs and simulated branch prediction analysis,” in Proceedings of the 2nd ACM symposium together with SMT enabled. In order to fairly evaluate the on Information, computer and communications security. ACM, 2007, STBPU impacts on overall throughput, we calculated the pp. 312–320. [3] O. Acıic¸mez, C¸. K. Koc¸, and J.-P. Seifert, “Predicting secret keys via Harmonic means (Hmeans) of IPCs as we equally value branch prediction,” in Cryptographers’ Track at the RSA Conference. each workload [51]. Figure 6 displays the overall IPC and Springer, 2007, pp. 225–242. accuracy impacts. In particular, we observed the ST SKLcond [4] S. Aga and S. Narayanasamy, “Invisimem: Smart memory defenses for memory side channel,” in ACM SIGARCH models suffers the most when SMT scenarios introduces more News, vol. 45, no. 2. ACM, 2017, pp. 94–106. ST rerandomizations while it can still achieve less than 2% [5] M. Alam, H. A. Khan, M. Dey, N. Sinha, R. Callan, A. Zajic, and throughput reduction. We believe this is because ST SKLcond M. Prvulovic, “One&done: A single-decryption em-based attack on openssl’s constant-time blinded {RSA},” in 27th {USENIX} Security model does not have a separate threshold register as TAGE Symposium ({USENIX} Security 18), 2018, pp. 585–602. models do for TAGE-table mispredictions. This results in [6] D. J. Bernstein, “Cache-timing attacks on aes,” 2005. more direction mispredictions as shown in the first chart of [7] S. Bhattacharya, C. Maurice, S. Bhasin, and D. Mukhopadhyay, “Branch prediction attack on blinded scalar multiplication,” IEEE Transactions Figure 6, which further affect the overall performance. On the on Computers, vol. 69, no. 5, pp. 633–648, 2019. other hand, the advanced models retain their efficiencies with [8] N. Binkert, B. Beckmann, G. Black, S. Reinhardt, A. Saidi, A. Basu, minimized accuracy reductions and throughput drop-downs. J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib Bin Altaf, N. Vaish, M. Hill, and D. Wood, “The gem5 3) ST Rerand. Overhead v.s. Future Attacks: Protections simulator,” SIGARCH Computer Architecture News, vol. 39, pp. 1–7, 08 need to be prepared for future challenges. In microarchitecture 2011. security, such an arms race involves advanced algorithms [59], [9] A. Bogdanov, L. R. Knudsen, G. Leander, C. Paar, A. Poschmann, M. J. B. Robshaw, Y. Seurin, and C. Vikkelsoe, “Present: An ultra- [61] and new hardware vulnerabilities [65], [85], potentially lightweight block cipher,” in Cryptographic Hardware and Embedded improving attack rates by orders of magnitude. Although Systems - CHES 2007, P. Paillier and I. Verbauwhede, Eds. Berlin, STBPU can maintain security by adjusting ST rerandomization Heidelberg: Springer Berlin Heidelberg, 2007, pp. 450–466. [10] A. Bogdanov, M. Knezeviˇ c,´ G. Leander, D. Toz, K. Varıcı, and I. Ver- frequencies (using smaller r settings), at what performance bauwhede, “spongent: A lightweight hash function,” in Cryptographic cost can be an interesting question. To explore this trade-off, Hardware and Embedded Systems – CHES 2011, B. Preneel and we broaden the STBPU evaluation with a range of aggressive T. Takagi, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 312–325. ST rerandomization thresholds, assuming any future attacks [11] J. Borghoff, A. Canteaut, T. Guneysu,¨ E. B. Kavun, M. Knezevic, L. R. increase the efficiency to 10 times, 100 times, and even more. Knudsen, G. Leander, V. Nikov, C. Paar, C. Rechberger et al., “Prince– To demonstrate an extreme instance, we select an advanced a low-latency block cipher for pervasive computing applications,” in International Conference on the Theory and Application of Cryptology BPU sensitive to branch history and test it under SMT scenar- and Information Security. Springer, 2012, pp. 208–225. ios that are more prone to mispredictions and evictions. Figure [12] C. Canella, J. V. Bulck, M. Schwarz, M. Lipp, B. von Berg, P. Ortner, 7 shows the STBPU-protected Tage SC L BPU can retain F. Piessens, D. Evtyushkin, and D. Gruss, “A systematic evaluation of transient execution attacks and defenses,” in 28th USENIX Security performance above 95% until the re-randomization thresholds Symposium (USENIX Security 19). Santa Clara, CA: USENIX triggers ST update every few hundreds of mispredictions or Association, Aug. 2019, pp. 249–266. [Online]. Available: https: evictions, which practically shutdowns any BPU training. //www.usenix.org/conference/usenixsecurity19/presentation/canella [13] G. Chen, S. Chen, Y. Xiao, Y. Zhang, Z. Lin, and T. H. Lai, “Sgxpec- We argue that the C2 constraint (uniformity) of remapping tre attacks: Leaking enclave secrets via speculative execution,” arXiv prevents STBPU from interfering with the nature of BPUs and preprint arXiv:1802.09085, 2018. applications having high prediction rates. On the other hand, [14] B. Coppens, I. Verbauwhede, K. De Bosschere, and B. De Sutter, “Practical mitigations for timing-based side-channel attacks on modern without uncovering both attacker and victim’s STs, an attacker x86 processors,” in Security and Privacy, 2009 30th IEEE Symposium tends to pollute his own program with deliberately generated on. IEEE, 2009, pp. 45–60.

12 [15] M. Evers, P.-Y. Chang, and Y. N. Patt, “Using hybrid branch predictors to [35] E. M. Koruyeh, K. Khasawneh, C. Song, and N. Abu-Ghazaleh, “Spectre improve branch prediction accuracy in the presence of context switches,” returns! speculation attacks using the return stack buffer,” arXiv preprint in ACM SIGARCH Computer Architecture News, vol. 24, no. 2. ACM, arXiv:1807.07940, 2018. 1996, pp. 3–11. [36] E. M. Koruyeh, K. N. Khasawneh, C. Song, and N. B. Abu- [16] D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh, “Jump over Ghazaleh, “Spectre returns! speculation attacks using the return aslr: Attacking branch predictors to bypass aslr,” in Microarchitecture stack buffer,” CoRR, vol. abs/1807.07940, 2018. [Online]. Available: (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. http://arxiv.org/abs/1807.07940 IEEE, 2016, pp. 1–13. [37] G. Leander and A. Poschmann, “On the classification of 4 bit s-boxes,” [17] D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh, “Understanding in Proceedings of the 1st International Workshop on Arithmetic of and mitigating covert channels through branch predictors,” ACM Trans- Finite Fields, ser. WAIFI ’07. Berlin, Heidelberg: Springer-Verlag, actions on Architecture and Code Optimization (TACO), vol. 13, no. 1, 2007, pp. 159–176. [Online]. Available: https://doi.org/10.1007/978-3- p. 10, 2016. 540-73074-3 13 [18] D. Evtyushkin, R. Riley, N. C. Abu-Ghazaleh, D. Ponomarev et al., [38] J. Lee, Y. Ishii, and D. Sunwoo, “Securing branch predictors with “Branchscope: A new side-channel attack on directional branch pre- two-level encryption,” vol. 17, no. 3, Aug. 2020. [Online]. Available: dictor,” in Proceedings of the Twenty-Third International Conference https://doi.org/10.1145/3404189 on Architectural Support for Programming Languages and Operating [39] S. Lee, M.-W. Shih, P. Gera, T. Kim, H. Kim, and M. Peinado, “Inferring Systems. ACM, 2018, pp. 693–707. fine-grained control flow inside sgx enclaves with branch shadowing,” in [19] Q. Ge, Y. Yarom, D. Cock, and G. Heiser, “A survey of microarchitec- 26th USENIX Security Symposium, USENIX Security, 2017, pp. 16–18. tural timing attacks and countermeasures on contemporary hardware,” [40] T. S. Lehman, A. D. Hilton, and B. C. Lee, “Poisonivy: Safe speculation Journal of Cryptographic Engineering, vol. 8, no. 1, pp. 1–27, 2018. for secure memory,” in The 49th Annual IEEE/ACM International [20] B. Grayson, J. Rupley, G. Z. Zuraski, E. Quinnell, D. A. Jimenez,´ Symposium on Microarchitecture. IEEE Press, 2016, p. 38. T. Nakra, P. Kitchin, R. Hensley, E. Brekelbaum, V. Sinha et al., “Evolu- [41] P. Li, L. Zhao, R. Hou, L. Zhang, and D. Meng, “Conditional spec- tion of the samsung exynos cpu microarchitecture,” in 2020 ACM/IEEE ulation: An effective approach to safeguard out-of-order execution 47th Annual International Symposium on Computer Architecture (ISCA). against spectre attacks,” in 2019 IEEE International Symposium on High IEEE, 2020, pp. 40–51. Performance Computer Architecture (HPCA), 2019, pp. 264–276. [21] M. Guarnieri, B. Kopf,¨ J. F. Morales, J. Reineke, and A. Sanchez,´ [42] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, “Spectector: Principled detection of speculative information flows,” in P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” arXiv 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 2020, pp. preprint arXiv:1801.01207, 2018. 1–19. [43] F. Liu, Q. Ge, Y. Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B. [22] R. Hat, “Controlling the performance impact of microcode and se- Lee, “Catalyst: Defeating last-level cache side channel attacks in cloud curity patches for cve-2017-5754 cve-2017-5715 and cve-2017-5753 computing,” in High Performance Computer Architecture (HPCA), 2016 using red hat enterprise linux tunables,” Red Hat Customer Portal, IEEE International Symposium on. IEEE, 2016, pp. 406–418. https://access.redhat.com/articles/3311301/. [44] F. Liu, Y. Yarom, Q. Ge, G. Heiser, and R. B. Lee, “Last-level cache [23] N. Hua, E. Norige, S. Kumar, and B. Lynch, “Non-crypto hardware hash side-channel attacks are practical,” in Security and Privacy (SP), 2015 functions for high performance networking asics,” in 2011 ACM/IEEE IEEE Symposium on. IEEE, 2015, pp. 605–622. Seventh Symposium on Architectures for Networking and Communica- [45] J. M, “Intel® digital random number generator (drng) tions Systems, Oct 2011, pp. 156–166. software implementation guide,” Oct 2019. [Online]. [24] C. Hunger, M. Kazdagli, A. Rawat, A. Dimakis, S. Vishwanath, and Available: https://software.intel.com/en-us/articles/intel-digital-random- M. Tiwari, “Understanding contention-based channels and using them number-generator-drng-software-implementation-guide for defense,” in High Performance Computer Architecture (HPCA), 2015 [46] G. Maisuradze and C. Rossow, “ret2spec: Speculative execution using IEEE 21st International Symposium on. IEEE, 2015, pp. 639–650. return stack buffers,” CoRR, vol. abs/1807.10364, 2018. [Online]. [25] T. Huo, X. Meng, W. Wang, C. Hao, P. Zhao, J. Zhai, and M. Li, Available: http://arxiv.org/abs/1807.10364 “Bluethunder: A 2-level directional predictor based side-channel attack [47] G. Maisuradze and C. Rossow, “Speculose: Analyzing the secu- against sgx,” IACR Transactions on Cryptographic Hardware and rity implications of speculative execution in cpus,” arXiv preprint Embedded Systems, vol. 2020, no. 1, pp. 321–347, Nov. 2019. [Online]. arXiv:1801.04084, 2018. Available: https://tches.iacr.org/index.php/TCHES/article/view/8401 [48] S. Mangard, “A simple power-analysis (spa) attack on implementations [26] M. S. Inci, B. Gulmezoglu, G. Irazoqui, T. Eisenbarth, and B. Sunar, of the aes key expansion,” in International Conference on Information “Cache attacks enable bulk key recovery on the cloud,” in Interna- Security and Cryptology. Springer, 2002, pp. 343–358. tional Conference on Cryptographic Hardware and Embedded Systems. [49] C. Maurice, C. Neumann, O. Heen, and A. Francillon, “C5: cross- Springer, 2016, pp. 368–388. cores cache covert channel,” in International Conference on Detection [27] Intel, “Intel analysis of speculative execution side channels,” January of Intrusions and Malware, and Vulnerability Assessment. Springer, 2018. 2015, pp. 46–64. [28] R. Intel, “Intel 64 and ia-32 architectures optimization reference man- [50] T. S. Messerges, E. A. Dabbish, and R. H. Sloan, “Examining smart-card ual,” Intel Corporation, May, 2020. security under the threat of power analysis attacks,” IEEE transactions [29] D. A. Jimenez´ and C. Lin, “Dynamic branch prediction with percep- on computers, vol. 51, no. 5, pp. 541–552, 2002. trons,” in Proceedings of the 7th International Symposium on High- [51] P. Michaud, “Demystifying multicore throughput metrics,” IEEE Com- Performance Computer Architecture, ser. HPCA ’01. USA: IEEE puter Architecture Letters, vol. 12, no. 2, pp. 63–66, 2012. Computer Society, 2001, p. 197. [52] P. Michaud, A. Seznec, and R. Uhlig, “Trading conflict and capacity [30] K. N. Khasawneh, E. M. Koruyeh, C. Song, D. Evtyushkin, aliasing in conditional branch predictors,” in ACM SIGARCH Computer D. Ponomarev, and N. B. Abu-Ghazaleh, “Safespec: Banishing the Architecture News, vol. 25, no. 2. ACM, 1997, pp. 292–303. spectre of a meltdown with leakage-free speculation,” CoRR, vol. [53] S. Mirbagher-Ajorpaz, G. Pokam, E. Mohammadian-Koruyeh, E. Garza, abs/1806.05179, 2018. [Online]. Available: http://arxiv.org/abs/1806. N. Abu-Ghazaleh, and D. A. Jimenez,´ “Perspectron: Detecting invari- 05179 ant footprints of microarchitectural attacks with perceptron,” in 2020 [31] T. Kim, M. Peinado, and G. Mainar-Ruiz, “Stealthmem: System-level 53rd Annual IEEE/ACM International Symposium on Microarchitecture protection against cache-based side channel attacks in the cloud.” in (MICRO). IEEE, 2020, pp. 1124–1137. USENIX Security symposium, 2012, pp. 189–204. [54] H. Naghibijouybari and N. Abu-Ghazaleh, “Covert channels on gpgpus,” [32] V. Kiriansky and C. Waldspurger, “Speculative buffer overflows: Attacks Computer Architecture Letters, 2016. and defenses,” arXiv preprint arXiv:1807.03757, 2018. [55] S. B. Ors, F. Gurkaynak, E. Oswald, and B. Preneel, “Power-analysis [33] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, attack on an asic aes implementation,” in Information Technology: S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom, “Spectre attacks: Coding and Computing, 2004. Proceedings. ITCC 2004. International Exploiting speculative execution,” arXiv preprint arXiv:1801.01203, Conference on, vol. 2. IEEE, 2004, pp. 546–552. 2018. [56] D. A. Osvik, A. Shamir, and E. Tromer, “Cache attacks and countermea- [34] P. Kocher, J. Jaffe, and B. Jun, “Differential power analysis,” in Annual sures: the case of aes,” in Cryptographers’ Track at the RSA Conference. International Cryptology Conference. Springer, 1999, pp. 388–397. Springer, 2006, pp. 1–20.

13 [57] C. Percival, “Cache missing for fun and profit,” 2005. [80] Z. Wang and R. B. Lee, “A novel cache architecture with enhanced [58] A. Prout, W. Arcand, D. Bestor, B. Bergeron, C. Byun, V. Gadepally, performance and security,” in Proceedings of the 41st annual IEEE/ACM M. Houle, M. Hubbell, M. Jones, A. Klein et al., “Measuring the impact International Symposium on Microarchitecture. IEEE Computer Soci- of spectre and meltdown,” arXiv preprint arXiv:1807.08703, 2018. ety, 2008, pp. 83–93. [59] A. Purnal, L. Giner, D. Gruss, and I. Verbauwhede, “Systematic analysis [81] M. Werner, T. Unterluggauer, L. Giner, M. Schwarz, D. Gruss, of randomization-based protected cache architectures,” in 42th IEEE and S. Mangard, “Scattercache: Thwarting cache attacks via cache Symposium on Security and Privacy. set randomization,” in 28th USENIX Security Symposium (USENIX [60] M. K. Qureshi, “Ceaser: Mitigating conflict-based cache attacks via Security 19). Santa Clara, CA: USENIX Association, Aug. 2019, encrypted-address and remapping,” in 2018 51st Annual IEEE/ACM pp. 675–692. [Online]. Available: https://www.usenix.org/conference/ International Symposium on Microarchitecture (MICRO), Oct 2018, pp. usenixsecurity19/presentation/werner 775–787. [82] M. Yan, J. Choi, D. Skarlatos, A. Morrison, C. W. Fletcher, and [61] M. K. Qureshi, “New attacks and defense for encrypted-address cache,” J. Torrellas, “Invisispec: Making speculative execution invisible in in 2019 ACM/IEEE 46th Annual International Symposium on Computer the ,” in Proceedings of the 51st Annual IEEE/ACM Architecture (ISCA). IEEE, 2019, pp. 360–371. International Symposium on Microarchitecture, ser. MICRO-51. [62] M. Raab and A. Steger, “”balls into bins” - a simple and tight analysis,” Piscataway, NJ, USA: IEEE Press, 2018, pp. 428–441. [Online]. in Proceedings of the Second International Workshop on Randomization Available: https://doi.org/10.1109/MICRO.2018.00042 and Approximation Techniques in Computer Science, ser. RANDOM [83] T.-Y. Yeh and Y. N. Patt, “Two-level adaptive training branch predic- ’98. Berlin, Heidelberg: Springer-Verlag, 1998, pp. 159–170. [Online]. tion,” in Proceedings of the 24th annual international symposium on Available: http://dl.acm.org/citation.cfm?id=646975.711521 Microarchitecture. ACM, 1991, pp. 51–61. [63] H. Raj, R. Nathuji, A. Singh, and P. England, “Resource management [84] J. Yu, M. Yan, A. Khyzha, A. Morrison, J. Torrellas, and C. W. for isolation enhanced cloud services,” in Proceedings of the 2009 ACM Fletcher, “Speculative taint tracking (stt): A comprehensive protection workshop on Cloud computing security. ACM, 2009, pp. 77–84. for speculatively accessed data,” in Proceedings of the 52nd Annual [64] M. Ramsay, C. Feucht, and M. H. Lipasti, “Exploring efficient smt IEEE/ACM International Symposium on Microarchitecture, ser. MICRO branch predictor design,” in Workshop on Complexity-Effective Design, ’52. New York, NY, USA: Association for Computing Machinery, in conjunction with ISCA, vol. 26, 2003. 2019, p. 954–968. [Online]. Available: https://doi.org/10.1145/3352460. [65] X. Ren, L. Moody, M. Taram, M. Jordan, D. M. Tullsen, and A. Venkat, 3358274 “I see dead µops: Leaking secrets via intel/amd micro-op caches.” [85] T. Zhang, K. Koltermann, and D. Evtyushkin, “Exploring branch [66] B. Saltaformaggio, D. Xu, and X. Zhang, “Busmonitor: A hypervisor- predictors for constructing transient execution trojans,” in Proceedings based solution for memory bus covert channels,” Proceedings of Eu- of the Twenty-Fifth International Conference on Architectural Support roSec, 2013. for Programming Languages and Operating Systems, ser. ASPLOS ’20. [67] M. Schwarz, M. Schwarzl, M. Lipp, and D. Gruss, “Netspectre: Read New York, NY, USA: Association for Computing Machinery, 2020, p. arbitrary memory over network,” 2018. 667–682. [Online]. Available: https://doi.org/10.1145/3373376.3378526 [68] A. Seznec and P. Michaud, “A case for (partially) tagged geometric his- [86] W. Zhang, Z. Bao, D. Lin, V. Rijmen, B. Yang, and I. Verbauwhede, tory length branch prediction,” Journal of Instruction-level Parallelism “Rectangle: a bit-slice lightweight block cipher suitable for multiple - JILP, vol. 8, 01 2006. platforms,” Science China Information Sciences, vol. 58, no. 12, pp. [69] A. Seznec, “A new case for the tage branch predictor,” in 1–15, Dec 2015. [Online]. Available: https://doi.org/10.1007/s11432- Proceedings of the 44th Annual IEEE/ACM International Symposium 015-5459-7 on Microarchitecture, ser. MICRO-44. New York, NY, USA: [87] W. Zhang, Z. Bao, V. Rijmen, and M. Liu, “A new classification of 4-bit Association for Computing Machinery, 2011, p. 117–127. [Online]. optimal s-boxes and its application to present, rectangle and spongent,” Available: https://doi.org/10.1145/2155620.2155635 in Fast Software Encryption, G. Leander, Ed. Berlin, Heidelberg: [70] A. Seznec, “Tage-sc-l branch predictors again,” 2016. Springer Berlin Heidelberg, 2015, pp. 494–515. [71] N. A. Simakov, M. D. Innus, M. D. Jones, J. P. White, S. M. Gallo, R. L. DeLeon, and T. R. Furlani, “Effect of meltdown and spectre patches on the performance of hpc applications,” arXiv preprint arXiv:1801.04329, 2018. [72] M. Taram, A. Venkat, and D. Tullsen, “Context-sensitive fencing: Secur- ing speculative execution via microcode customization,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 395– 410. [73] C. Trippel, D. Lustig, and M. Martonosi, “Meltdownprime and spec- treprime: Automatically-synthesized attacks exploiting invalidation- based coherence protocols,” arXiv preprint arXiv:1802.03802, 2018. [74] L. Tung, “Linus torvalds: After big linux performance hit, spectre v2 patch needs curbs,” ZDNet, https://www.zdnet.com/article/linus-torvalds- after-big-linux-performance-hit-spectre-v2-patch-needs-curbs/. [75] L. Tung, “Windows 10 will banish spectre slowdowns with google’s ret- poline patch,” ZDNet, https://www.zdnet.com/article/windows-10-will- banish-spectre-slowdowns-with-googles-retpoline-patch/. [76] V. Varadarajan, T. Ristenpart, and M. M. Swift, “Scheduler-based de- fenses against cross-vm side-channels.” in USENIX Security Symposium, 2014, pp. 687–702. [77] I. Vougioukas, N. Nikoleris, A. Sandberg, S. Diestelhorst, B. M. Al- Hashimi, and G. V. Merrett, “Brb: Mitigating branch predictor side- channels.” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 466–477. [78] Z. Wang and R. B. Lee, “New cache designs for thwarting software cache-based side channel attacks,” in ACM SIGARCH Computer Archi- tecture News, vol. 35, no. 2. ACM, 2007, pp. 494–505. [79] Z. Wang and R. B. Lee, “New cache designs for thwarting software cache-based side channel attacks,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, ser. ISCA ’07. New York, NY, USA: ACM, 2007, pp. 494–505. [Online]. Available: http://doi.acm.org/10.1145/1250662.1250723

14