PC Processor Microarchitecture

1 PC Processor Microarchitecture A Concise Review of the Techniques Used in Modern PC Processors by Keith Diefendorff For parallelism, today’s PC processors rely on pipelining and superscalar techniques to exploit instruction-level par- Having commandeered nearly all the performance- allelism (ILP). Pipelined processors overlap instructions in enhancing techniques used by their mainframe and super- time on common execution resources. Superscalar processors computer predecessors, the microprocessors in today’s PCs overlap instructions in space on separate resources. Both tech- employ a dizzying assemblage of microarchitectural features niques are used in combination. to achieve extraordinary levels of parallelism and speed. Unfortunately, performance gains from parallelism Enabled by astronomical transistor budgets, modern PC often fail to meet expectations. Although a four-stage pipe- processors are superscalar, deeply pipelined, out of order, and line, for example, overlaps the execution of four instructions, they even execute instructions speculatively. In this article, we as Figure 1 shows, it falls far short of a 4× performance boost. review the basic techniques used in these processors as well as The problem is pipeline stalls. Stalls arise from data hazards the tricks they employ to circumvent the two most challeng- (data dependencies), control hazards (changes in program ing performance obstacles: memory latency and branches. flow), and structural hazards (hardware resource conflicts), all of which sap pipeline efficiency. Two Paths to Performance Lengthening the pipeline, or superpipelining, divides The task normally assigned to chip architects is to design the instruction execution into more stages, each with a shorter highest-performance processor possible within a set of cost, cycle time; it does not, in general, shorten the execution time power, and size constraints established by market require- of instructions. In fact, it may increase execution time because ments. Within these constraints, application performance is stages rarely divide evenly and the frequency is set by the usually the best measure of success, although, sadly, the mar- longest stage. In addition, longer pipelines experience a ket often mistakes clock frequency for performance. higher percentage of stall cycles from hazards, thereby in- Two main avenues are open to designers trying to creasing the average cycles per instruction (CPI). Super- improve performance: making operations faster or executing scalar techniques suffer from similar inefficiencies. more of them in parallel. Operations can be made faster in The throughput gains from a longer pipeline, however, several ways. More advanced semiconductor processes make usually outweigh the CPI loss, so performance improves. But transistors switch faster and signals propagate faster. Using lengthening the pipeline has limits. As stages shrink, clock more transistors can reduce execution-unit latency (e.g., full skew and latch overheads (setup and hold times) consume a vs. partial multiplier arrays). Aggressive design methods can larger fraction of the cycle, leaving less usable time for logic. minimize the levels of logic needed to implement a given The challenge is to make the pipeline short enough for function (e.g., custom vs. standard-cell design) or to increase good efficiency but not so short that ILP and frequency are left circuit speed (e.g., dynamic vs. static circuits). lying on the table, i.e., an underpipelined condition. Today’s PC processors use pipelines of 5 to 12 stages. When making this decision, designers must keep in mind that frequency is often more important in the market than performance. Instr1 Fetch Issue Execute Write Instr2 Instr3 Prophetic Hardware for Long Pipelines Instr4 Branch prediction and speculative execution are techniques used to reduce pipeline stalls on control hazards. In a pipelined processor, conditional branches are often encoun- Instr1 tered before the data that will determine branch direction is Instr2 Instr3 ready. Because instructions are fetched ahead of execution, Instr4 correctly predicting unresolved branches allows the instruc- Instr5 Instr6 tion fetcher to keep the instruction queue filled with instruc- Instr7 tions that have a high probability of being used. Instr8 Instr9 Some processors take the next step, actually executing instructions speculatively past unresolved conditional Figure 1. Pipelines overlap the execution of instructions in time. Lengthening the pipeline increases the number of instructions exe- branches. This technique avoids the control-hazard stall cuted in a given time period. Longer pipelines, however, suffer altogether when the branch goes in the predicted direction. from a higher percentage of stalls (not shown). On mispredictions, however, the pipeline must be flushed, ©MICRODESIGN RESOURCES JULY 12, 1999 MICROPROCESSOR REPORT 2 instruction fetch redirected, and the pipeline refilled. Statis- Although compilers can statically reschedule instruc- tically, prediction and speculation dramatically reduce stalls. tions, they are hampered by incomplete knowledge of run- How dramatically depends on prediction accuracy. time information. Load-use penalties, for example, are Branch predictors range in sophistication from simple resistant to static rescheduling because their length is gener- static predictors (compiler or heuristic driven), which ally unpredictable at compile time. It is simply impossible to achieve 65–85% accuracy, to complex dynamic predictors find enough independent instructions to cover the worst- that can achieve 98% accuracy or more. Since one in five case number of load-delay slots in every load. instructions is typically a conditional branch, high accuracy is Static rescheduling is also constrained by register name- essential, especially for machines with long pipelines and, space and by ambiguous dependencies between memory therefore, with large mispredict penalties. As a result, most instructions. A large register namespace is required for good modern processors employ dynamic predictors. register allocation, for freedom in rearranging instructions, and for loop unrolling. Register limitations are especially The Past Predicts the Future severe in x86 processors, which have only eight general- The simplest dynamic predictor is the branch history table purpose registers. In-order processors—which issue, execute, (BHT), a small cache indexed by the address of the branch complete, and retire instructions in strict program order— being predicted. Simple BHTs record one-bit histories of the must rely entirely on static rescheduling and can suffer a large direction each branch took the last time it executed. More number of pipeline stalls. sophisticated BHTs use two-bit histories, which add hystere- Therefore, most current PC processors implement sis to improve prediction accuracy on loop branches. Even dynamic instruction rescheduling to some degree. The sim- more sophisticated schemes use two-level predictors with plest out-of-order processors issue instructions in order but longer per-branch histories that index into pattern tables allow them to execute and complete out of order. Processors containing two-bit predictors (see MPR 3/27/95, p. 17). of this type use register scoreboarding to interlock the pipe- A simplified version of the two-level predictor uses a line, stalling instruction issue when an instruction’s operands single global-history register of recent branch directions to aren’t ready. Such processors can achieve somewhat more index into the BHT. The GShare enhancement (see MPR parallelism than in-order processors by permitting instruc- 11/17/97, p. 22) adds per-branch sensitivity by hashing a few tions to execute in parallel through execution units with dif- bits of the branch address with the global-history register, as ferent or variable latencies. Figure 2 shows. The agrees-mode enhancement encodes the Even simple out-of-order processors require complex prediction as agreement or disagreement with a static pre- hardware to reorder results before the corresponding in- diction, thereby avoiding excessive mispredictions when structions are retired (removed from the machine). Although multiple active branches map to the same BHT entry. In strict result ordering is not needed from a data-flow perspec- architectures with no static-prediction opcode bits, such as tive, it is required to maintain precise exceptions (the appear- the x86, the static prediction must be based on branch ance of in-order execution following an interrupt) and to heuristics (e.g., backward: predict taken). recover from mispredicted speculative execution. Some processors predict the target instruction stream as The most common reordering method is the reorder well as the direction. Target predictions are made with a buffer (ROB), which buffers results until they can be written branch target address cache (BTAC), which caches the to the register file in program order. Accessing operands address to which control was transferred the last time the from the reorder buffer, which is needed for reasonable per- branch was taken. BTACs are sometimes combined with the formance, requires an associative lookup to locate the most BHT into a branch target buffer (BTB). Instead of a BTAC, recent version of the operand. some processors use a branch target instruction cache (BTIC), which caches the first few instructions down the tar- Global Branch History 01 get path so the pipeline can be primed without an inline fetch Register 11 Branch

Load more