Improving the Utilization of Micro-operation Caches in x86 Processors Jagadish B. Kotra, John Kalamatianos AMD Research, USA. Email:
[email protected],
[email protected] Abstract—Most modern processors employ variable length, hidden from the Instruction Set Architecture (ISA) to ensure Complex Instruction Set Computing (CISC) instructions to backward compatibility. Such an ISA level abstraction enables reduce instruction fetch energy cost and bandwidth require- processor vendors to implement an x86 instruction differently ments. High throughput decoding of CISC instructions requires energy hungry logic for instruction identification. Efficient CISC based on their custom micro-architectures. However, this ad- instruction execution motivated mapping them to fixed length ditional step of translating each variable length instruction in micro-operations (also known as uops). To reduce costly decoder to fixed length uops incurs high decode latency and consumes activity, commercial CISC processors employ a micro-operations more power thereby affecting the instruction dispatch band- cache (uop cache) that caches uop sequences, bypassing the width to the back-end [5, 34, 22]. decoder. Uop cache’s benefits are: (1) shorter pipeline length for uops dispatched by the uop cache, (2) lower decoder energy To reduce decode latency and energy consumption of x86 consumption, and, (3) earlier detection of mispredicted branches. decoders, researchers have proposed caching the decoded uops In this paper, we observe that a uop cache can be heavily in a separate hardware structure called uop cache2 [51, 36]. An fragmented under certain uop cache entry construction rules. uop cache stores recently decoded uops anticipating temporal Based on this observation, we propose two complementary reuse [17, 51].