I See Dead Μops: Leaking Secrets Via Intel/AMD Micro-Op Caches
Total Page:16
File Type:pdf, Size:1020Kb
I See Dead µops: Leaking Secrets via Intel/AMD Micro-Op Caches Xida Ren Logan Moody Mohammadkazem Taram Matthew Jordan University of Virginia University of Virginia University of California, San Diego University of Virginia [email protected] [email protected] [email protected] [email protected] Dean M. Tullsen Ashish Venkat University of California, San Diego University of Virginia [email protected] [email protected] translation; the resulting translation penalty is a function of Abstract—Modern Intel, AMD, and ARM processors translate complex instructions into simpler internal micro-ops that are the complexity of the x86 instruction, making the micro-op then cached in a dedicated on-chip structure called the micro- cache a ripe candidate for implementing a timing channel. op cache. This work presents an in-depth characterization study However, in contrast to the rest of the on-chip caches, the of the micro-op cache, reverse-engineering many undocumented micro-op cache has a very different organization and design; features, and further describes attacks that exploit the micro- op cache as a timing channel to transmit secret information. as a result, conventional cache side-channel techniques do not In particular, this paper describes three attacks – (1) a same directly apply – we cite several examples. First, software is thread cross-domain attack that leaks secrets across the user- restricted from directly indexing into the micro-op cache, let kernel boundary, (2) a cross-SMT thread attack that transmits alone access its contents, regardless of the privilege level. secrets across two SMT threads via the micro-op cache, and Second, unlike traditional data and instruction caches, the (3) transient execution attacks that have the ability to leak an unauthorized secret accessed along a misspeculated path, micro-op cache is designed to operate as a stream buffer, even before the transient instruction is dispatched to execution, allowing the fetch engine to sequentially stream decoded breaking several existing invisible speculation and fencing-based micro-ops spanning multiple ways within the same cache solutions that mitigate Spectre. set (see Section II for more details). Third, the number of I. INTRODUCTION micro-ops within each cache line can vary drastically based on various x86 nuances such as prefixes, opcodes, size of Modern processors feature complex microarchitectural immediate values, and fusibility of adjacent micro-ops. Finally, structures that are carefully tuned to maximize performance. the proprietary nature of the micro-op ISA has resulted in However, these structures are often characterized by observ- minimal documentation of important design details of the able timing effects (e.g., a cache hit or a miss) that can micro-op cache (such as its replacement policies); as a result, be exploited to covertly transmit secret information, bypass- their security implications have been largely ignored. ing sandboxes and traditional privilege boundaries. Although numerous side-channel attacks [1]–[10] have been proposed In this paper, we present an in-depth characterization of in the literature, the recent influx of Spectre and related the micro-op cache that not only allows us to confirm our attacks [11]–[19] has further highlighted the extent of this understanding of the few features documented in Intel/AMD threat, by showing how even transiently accessed secrets can manuals, but further throws light on several undocumented be transmitted via microarchitectural channels. This work features such as its partitioning and replacement policies, in exposes a new timing channel that manifests as a result of both single-threaded and multi-threaded settings. an integral performance enhancement in modern Intel/AMD By leveraging this knowledge of the behavior of the micro- processors – the micro-op cache. op cache, we then propose a principled framework for au- The x86 ISA implements a variety of complex instructions tomatically generating high-bandwidth micro-op cache-based that are internally broken down into RISC-like micro-ops at timing channel exploits in three primary settings – (a) across the front-end of the processor pipeline to facilitate a simpler code regions within the same thread, but operating at dif- backend. These micro-ops are then cached in a small dedicated ferent privilege levels, (b) across different co-located threads on-chip buffer called the micro-op cache. This feature is running simultaneously on different SMT contexts (logical both a performance and a power optimization that allows the cores) within the same physical core, and (c) two transient instruction fetch engine to stream decoded micro-ops directly execution attack variants that leverage the micro-op cache to from the micro-op cache when a translation is available, leak secrets, bypassing several existing hardware and software- turning off the rest of the decode pipeline until a micro- based mitigations, including Intel’s recommended LFENCE. op cache miss occurs. In the event of a miss, the decode The micro-op cache as a side channel has several dangerous pipeline is re-activated to perform the internal CISC-to-RISC implications. First, it bypasses all techniques that mitigate caches as side channels. Second, these attacks are not detected BPU: predicts by any existing attack or malware profile. Third, because the next-instruction address micro-op cache sits at the front of the pipeline, well before Instruction Fetch Unit Micro-Op Cache execution, certain defenses that mitigate Spectre and other 16 Bytes / Cycle Macro-Op 8 ways Cache Hierarchy transient execution attacks by restricting speculative cache Instructions From Queue 6 Macro-Ops / Cycle updates still remain vulnerable to micro-op cache attacks. Predecoder (Length Decode) Most existing invisible speculation and fencing-based solu- 5 fused tions focus on hiding the unintended vulnerable side-effects of Decoders macro-ops / cycle 32 sets speculative execution that occur at the back end of the proces- 1 1 1 sor pipeline, rather than inhibiting the source of speculation ↓ ↓ ↓ 1 1 1 ~ 4 at the front-end. That makes them vulnerable to the attack MSROM 6 micro-ops / cycle we describe, which discloses speculatively accessed secrets 5 fused 4 micro-ops / cycle through a front-end side channel, before a transient instruction micro-ops / cycle 4 micro-ops / cycle has the opportunity to get dispatched for execution. This eludes Instruction Decode Queue (IDQ) a whole suite of existing defenses [20]–[32]. Furthermore, due Fig. 1: x86 micro-op cache and decode pipeline to the relatively small size of the micro-op cache, our attack is significantly faster than existing Spectre variants that rely on priming and probing several cache sets to transmit secret of three to six cycles when a length-changing prefix (LCP) information, and is considerably more stealthy, as it uses the is encountered, as the decoded instruction length is different micro-op cache as its sole disclosure primitive, introducing from the default length. As the macro-ops are decoded and fewer data/instruction cache accesses, let alone misses. placed in the macro-op queue, particular pairs of consecutive In summary, we make the following major contributions. macro-op entries may be fused together into a single macro-op • We present an in-depth characterization of the micro- to save decode bandwidth. op cache featured in Intel and AMD processors, reverse Multiple macro-ops are read from the macro-op queue engineering several undocumented features, including its every cycle, and distributed to the decoders which translate replacement and partitioning policies. each macro-op into internal RISC (Reduced Instruction Set • We propose mechanisms to automatically generate exploit Computing)-like micro-ops. The Skylake microarchitecture code that leak secrets by leveraging the micro-op cache features: (a) multiple 1:1 decoders that can translate simple as a timing channel. macro-ops that only decompose into one micro-op, (b) a • We describe and evaluate four attack variants that exploit 1:4 decoder that can translate complex macro-ops that can the novel micro-op cache vulnerability – (a) cross-domain decompose into anywhere between one and four micro-ops, same address-space attack, (b) cross-SMT thread attack, and (c) a microsequencing ROM (MSROM) that translates and (c) two transient execution attack variants. more complex microcoded instructions, where a single macro- • We comment on the extensibility of existing cache side- op can translate into more than four micro-ops, potentially in- channel and transient execution attack mitigations to volving multiple branches and loops, taking up several decode address the vulnerability we expose, and further suggest cycles. The decode pipeline can provide a peak bandwidth of potential attack detection and defense mechanisms. 5 micro-ops per cycle [33]. In contrast, AMD Zen features II. BACKGROUND AND RELATED WORK four 1:2 decoders and relegates to a microcode ROM when This section provides relevant background on the x86 front- it encounters a complex instruction that translates into more end and its salient components, including the micro-op cache than two micro-ops. and other front-end features, in reference to Intel Skylake and The translated micro-ops are then queued up in an Instruc- AMD Zen microarchitectures. We also briefly discuss related tion Decode Queue (IDQ) for further processing by the rest of research on side-channel and transient execution attacks. the pipeline. In a particular cycle, micro-ops can be delivered to the IDQ by the micro-op cache, or if not available, the more A. The x86 Decode Pipeline expensive full decode pipeline is turned on and micro-ops are Figure 1 shows the decoding process in an x86 processor. delivered that way. In the absence of any optimizations, the instruction fetch In Intel architectures, the macro-op queue, IDQ, and the unit reads instruction bytes corresponding to a 16-byte code micro-op cache remain partitioned across different SMT region from the L1 instruction cache into a small fetch buffer threads running on the same physical core, while AMD allows every cycle.