Optimizations for Improving Fetch Bandwidth of Itanium Processors

Mesocode: Optimizations for Improving Fetch Bandwidth of Future Itaniumâ Processors Marsha Eng1, Hong Wang1, Perry Wang1, Alex Ramirez2, Jim Fung1, John Shen1 1Microprocessor Research Labs, Intel Labs Santa Clara, CA 95054 2Computer Architecture Department UPC, Barcelona, Spain ABSTRACT We introduce the concept of mesocode, a code format designed In this work, we introduce the concept of mesocode, a code to improve fetch bandwidth efficiency of future Itaniumâ format exhibiting the benefits of both traditional macro-code and processors. Mesocode introduces modest changes at the ISA microcode. Like macro-code (i.e. instruction set architecture), level in an implementation specific and legacy-free format that mesocode format is architecturally visible to post-pass software is unobtrusive to the original code. For the specific tool. In addition, like micro-code, mesocode is implementation implementation studied in this paper, mesocode is used to specific and legacy free. In this paper, we focus on a mescocode express the frequently execution portions of the original code in format designed specifically to improve fetch bandwidth a more compact and efficient manner and the optimized efficiency of future Itaniumâ processors. The mesocode mesocode is attached as appendix to the original code. On processor supporting mesocode, at run time, mesocode can be instruction format is more compact than the canonical Itaniumâ ISA. This mesocode design is based on stream, a sequence of frequently executed in place of the counterparts in the original dynamic instructions between two consecutive taken branches. code to enable more efficient use of the machine resources. On The machine supporting the stream-based mesocode can predict, processor that does not support mesocode, however, the fetch, and execute these streams efficiently. Our studies show mesocode will simply not be activated and the original code will that for a set of SPEC2000 benchmarks less than 10% of the run canonically thus ensuring backward and forward streams cover over 90% of dynamic instructions, indicating that compatibility. optimizing a small set of streams can potentially improve performance for a large portion of the program execution. For Our study bases mesocode on streams, which are sequences of dynamic (a.k.a. out-of-order) processor implementations, the dynamic instructions between two consecutive taken branches front-end fetch bandwidth is critical to performance. Our [24]. Our evaluation shows that about 10% of the unique experiments show that average performance improvement of streams cover over 90% of the dynamic instructions, making 32% can be achieved through mesocode optimization, which them good candidates for mesocode encoding. Optimizations both improves code density of the frequently executed portions made to a small number of streams effectively benefit the entire of the code and enables more timely and effective instruction program. prefetches. The key contribution of this paper is to demonstrate that mesocode can improve fetch bandwidth by reducing wasteful 1. INTRODUCTION instructions brought into the pipeline. In addition to benefiting This paper presents a case study regarding how to apply from effective prefetch of streams, the removal of NOPs from complexity-effective design principle to tackle a particular streams results in a significant gain, about 32% on an out-of- challenge for implementing future out-of-order Itaniumâ order machine. The improvement in code density stemming processor, namely, the complexity posed by the instruction from NOP removal not only increases cache utilization but also encoding format in the Itaniumâ instruction set (ISA). reduces resource contention in the front-end pipeline for resources such as the instruction queue. The Itaniumâ ISA, like traditional VLIW, suits wide in-order EPIC styled implementation schemes. For example, multiple The rest of the paper is organized as follows. In Section 2, we instructions are bundled together to form the basic fetch unit, in highlight how current architectures facilitate a loss of fetch which a template field is explicitly encoded to provide ancillary bandwidth. Section 3 introduces the details of mesocode. information about each instruction inside the bundle. In addition, Section 4 describes the performance evaluation methodology each instruction is also encoded with bit-field for the predicate including simulation infrastructure and workload setup. Section register due to use of predication in the ISA. Should compiler 5 presents evaluations of a range of mesocode optimizations and fail to find an independent instruction for a bundle, a NOP is their performance impact. Section 6 reviews related work. usually used to fill in to ensure the bundle conform to one Section 7 concludes the paper. particular template encoding. Obviously, NOP, template, predicated false instruction and predicate register bits can 2. INSTRUCTION ENCODING IN THE contribute to waste of processor resources such as cache footprint, fetch bandwidth, instruction queue etc. Consequently, ITANIUMâ ARCHITECTURE the wastage can result in performance loss. EPIC ISAs, including the Itaniumâ ISA, uses template-carrying bundle as atomic unit of fetch and execution. The template actually makes it possible to decipher the types of instructions in the bundle well before the instructions are decoded. Individual regions of microcode, which consist of instruction syllables or instructions inside a bundle act more like micro-ops and will be micro-ops that are not restricted by any other Itaniumâ referred to as such to avoid confusion. Stop bits are used to templates. The microcoded regions are kept separate as express parallelism (for instructions between stop -bits) and data appendices to the original code and are unobtrusive to the dependency (for instructions across stop -bits) behavior. The original program, as illustrated in Figure 1. In our Itaniumâ ISA also includes predication and static branch hints implementation, the microcode redundantly expresses frequently on a micro-op level, which in conjunction with the stop bits and executed portions of the original code, although they are templates, could be used to express program behavior at potentially encoded very differently. As far as the original code granularity beyond the traditional basic block level. is concerned, the mesocode is no different from, say, instructions for a co-processor. The problem with forcing micro-ops into fixed issue templates is that NOPs are introduced to the code when no useable In our version of mesocode, the code regions are executed instructions can be found to fill out the rest of the template. speculatively. At the start of the mesocode section is a table These NOPs dilute the code density and degrade cache and mapping the original addresses of the starting instructions to pipeline utilization by taking up valuable space and pipeline their new realigned instruction addresses in the mesocode resources that could be filled with useful instructions. The section. A monitor at speculatively redirects the instruction effective fetch bandwidth is reduced due to the pollution from pointer to the realigned address on its next invocation. If any these wasteful instructions. Predication can have the same effect instruction in the mesocoded region causes an exception or in that instructions that are predicated false at runtime executes incorrectly, program execution can always fall back to effectively become NOPs in the dynamic code stream, which the original code. As a result, the mesocode-enhanced binary is occupy resources and degrade the IPC. Another problem with guaranteed to be correct because it does not modify the original using fixed issue templates is that branch target is required to be code; in the worst case, the original code is executed. bundle aligned. This can introduce cache line fragmentation Separating the mesocoded sections from the original code also when the cache line is bigger than a bundle. When a taken ensures that the binary is backwards compatible. branch or a branch target is not aligned to the cacheline, then the rest of the cache line will be waste, which reduces effective In addition to the ISA modifications, mesocode requires modest usage of the fetch bandwidth. These problems can be solved hardware changes to be effective. Mesocoded instructions are with a flexible instruction encoding that can avoid emplicit the same as Itaniumâ ISA instruction syllables, so the portions encoding of NOPs and prevent misalignment to the cache line of the core devoted to executing instructions and the caches and due to branches. memory subsystems remain the same. The two templates marking the start and end of the microcode regions are now used as “escapes” from the original code stream. When the decoder Original code with dominant Original code with dominant encounters the start template, it switches to a special mesocode regions highlighted regions highlighted decoder that can handle the micro-ops. When the decoder encounters the end template, it switches back to the standard decode mechanism. The fetch mechanism is changed to recognize the new escape templates and fetch continuously until it reaches the end of the mesocoded region. The instruction issue for mesocode regions does not have to check for the Separate mesocode section— template because it is nonexistent in mesocode encoding. regions are cache aligned and Within the regions, microcode should be scheduled in such a compressed way that the instruction issue

Optimizations for Improving Fetch Bandwidth of Itanium Processors

Branch Prediction Side Channel Attacks

BRANCH PREDICTORS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Microcode Revision Guidance August 31, 2019 MCU Recommendations

18-741 Advanced Computer Architecture Lecture 1: Intro And

State of the Art Regarding Both Compiler Optimizations for Instruction Fetch, and the Fetch Architectures for Which We Try to Optimize Our Applications

Branch Prediction for Network Processors

The Central Processor Unit

Trends in Processor Architecture

Pep8cpu: a Programmable Simulator for a Central Processing Unit J

Reverse Engineering X86 Processor Microcode

Introduction to Microcoded Implementation of a CPU Architecture

ECE 4750 Computer Architecture Topic 1: Microcoding