The Alpha 21264 Microprocessor

THE ALPHA 21264 MICROPROCESSOR THE ALPHA 21264 OWES ITS HIGH PERFORMANCE TO HIGH CLOCK SPEED, MANY FORMS OF OUT-OF-ORDER AND SPECULATIVE EXECUTION, AND A HIGH- BANDWIDTH MEMORY SYSTEM. Alpha microprocessors have been nondependent work, which results in faster performance leaders since their introduction execution because critical-path computations in 1992. The first generation 21064 and the start and complete quickly. later 211641,2 raised expectations for the The processor also employs speculative exe- newest generation—performance leadership cution to maximize performance. It specula- was again a goal of the 21264 design team. tively fetches and executes instructions even Benchmark scores of 30+ SPECint95 and 58+ though it may not know immediately whether SPECfp95 offer convincing evidence thus far the instructions will be on the final execution that the 21264 achieves this goal and will con- path. This is particularly useful, for instance, tinue to set a high performance standard. when the 21264 predicts branch directions and A unique combination of high clock speeds speculatively executes down the predicted path. and advanced microarchitectural techniques, Sophisticated branch prediction, coupled including many forms of out-of-order and with speculative and dynamic execution, speculative execution, provide exceptional core extracts instruction parallelism from applica- computational performance in the 21264. The tions. With more functional units and these R. E. Kessler processor also features a high-bandwidth mem- dynamic execution techniques, the processor ory system that can quickly deliver data values is 50% to 200% faster than its 21164 prede- Compaq Computer to the execution core, providing robust perfor- cessor for many applications, even though mance for a wide range of applications, includ- both generations can fetch at most four Corporation ing those without cache locality. The advanced instructions per cycle.5 performance levels are attained while main- The 21264’s memory system also enables taining an installed application base. All Alpha high performance levels. On-chip and off- generations are upward-compatible. Database, chip caches provide for very low latency data real-time visual computing, data mining, med- access. Additionally, the 21264 can service ical imaging, scientific/technical, and many many parallel memory references to all caches other applications can utilize the outstanding in the hierarchy, as well as to the off-chip performance available with the 21264. memory system. This permits very high bandwidth data access.6 For example, the proces- Architecture highlights sor can sustain more than 1.3 GBytes/sec on The 21264 is a superscalar microprocessor the Stream benchmark.7 that can fetch and execute up to four instruc- The microprocessor’s cycle time is 500 to tions per cycle. It also features out-of-order 600 MHz, implemented by 15 million tran- execution.3,4 With this, instructions execute sistors in a 2.2-V, 0.35-micron CMOS process as soon as possible and in parallel with other with six metal layers. The 3.1 cm2 processor 24 0272-1732/99/$10.00 1999 IEEE . comes in a 587-pin PGA package. It can exe- Line and way prediction Bus Float Integer cute up to 2.4 billion instructions per second. The processor implements interface map mapper unit Figure 1 shows a photo of the 21264, high- a line and way prediction tech- and lighting major sections. Figure 2 is a high-level nique that combines the queue Memory (cluster 1) (cluster Integer 0) (cluster overview of the 21264 pipeline, which has advantages of set-associative controller Integer unit Integer unit seven stages, similar to the earlier in-order behavior and fetch bubble queue Data and control buses 21164. One notable addition is the map stage elimination, together with the Floating-point units Memory controller that renames registers to expose instruction fast access time of a direct- fetch parallelism—this addition is fundamental to mapped cache. Figure 3 (next Instruction the 21264’s out-of-order techniques. page) shows the technique’s Data main features. Each four- Instruction cache Instruction pipeline—Fetch instruction fetch block cache BIU The instruction pipeline begins with the includes a line and way pre- fetch stage, which delivers four instructions diction. This prediction indi- to the out-of-order execution engine each cates where to fetch the next Figure 1. Alpha 21264 microprocessor die cycle. The processor speculatively fetches block of four instructions, photo. BIU stands for bus interface unit. through line, branch, or jump predictions. including which way—that is, Since the predictions are usually accurate, this which of the two choices instruction fetch implementation typically allowed by two-way associative cache. supplies a continuous stream of good-path The processor reads out the next instruc- instructions to keep the functional units busy tions using the prediction (via the wraparound with useful work. path in Figure 3) while, in parallel, it completes Two architectural techniques increase fetch the validity check for the previous instruc- efficiency: line and way prediction, and tions. Note that the address paths needing branch prediction. A 64-Kbyte, two-way set- extra logic levels—instruction decode, branch associative instruction cache offers much- prediction, and cache tag comparison—are improved level-one hit rates compared to the outside the critical fetch loop. 8-Kbyte, direct-mapped instruction cache in The processor loads the line and way pre- the Alpha 21164. dictors on an instruction cache fill, and Fetch Slot Rename Issue Register read Execute Memory 0 1 2 3 4 5 6 Integer Integer Integer execution Integer issue register Branch Addr predictor register queue file Integer rename (20 (80) execution entries) Data Level- Mux Integer two Integer cache execution (64 Kbytes, cache register and system file Addr two-way) Mux Integer interface (80) execution Line/set prediction Instruction Floating- Floating- Floating- Floating-point cache point point point multiply execution (64 Kbytes, issue register register two-way) queue rename (15) file Floating-point (72) add execution Figure 2. Stages of the Alpha 21264 instruction pipeline. MARCH–APRIL 1999 25 . ALPHA 21264 lines to tolerate the additional latency. The Program counter (PC) Learn dynamic jumps result is very high bandwidth instruction generation fetch, even when the instructions are not Instruction … No branch penalty found in the instruction cache. For instance, decode, Mux branch Set associativity the processor can saturate the available L2 prediction, PC cache bandwidth with instruction prefetches. validity check Mux Branch prediction Branch prediction is more important to the Tag Tag 21264’s efficiency than to previous micro- 0 1 Cached Line Way processors for several reasons. First, the seven- instructions prediction prediction cycle mispredict cost is slightly higher than previous generations. Second, the instruction Compare Compare execution engine is faster than in previous generations. Finally, successful branch prediction Hit/miss/way miss Instructions (4) Next line plus way can utilize the processor’s speculative execution capabilities. Good branch prediction avoids the Figure 3. Alpha 21264 instruction fetch. The line and way prediction (wrap- costs of mispredicts and capitalizes on the most around path on the right side) provides a fast instruction fetch path that opportunities to find parallelism. The 21164 avoids common fetch stalls when the predictions are correct. could accept 20 in-flight instructions at most, but the 21264 can accept 80, offering many more parallelism opportunities. dynamically retrains them when they are in The 21264 implements a sophisticated tour- error. Most mispredictions cost a single cycle. nament branch prediction scheme. The scheme The line and way predictors are correct 85% dynamically chooses between two types of to 100% of the time for most applications, so branch predictors—one using local history, and training is infrequent. As an additional pre- one using global history—to predict the direc- caution, a 2-bit hysteresis counter associated tion of a given branch.8 The result is a tourna- with each fetch block eliminates overtrain- ment branch predictor with better prediction ing—training occurs only when the current accuracy than larger tables of either individual prediction has been in error multiple times. method, with a 90% to 100% success rate on Line and way prediction is an important speed most simulated applications/benchmarks. enhancement since the mispredict cost is low Together, local and global correlation tech- and line/way mispredictions are rare. niques minimize branch mispredicts. The Beyond the speed benefits of direct cache processor adapts to dynamically choose the best access, line and way prediction has other ben- method for each branch. efits. For example, frequently encountered Figure 4, in detailing the structure of the predictable branches, such as loop termina- tournament branch predictor, shows the local- tors, avoid the mis-fetch penalty often associ- history prediction path—through a two-level ated with a taken branch. The processor also structure—on the left. The first level holds 10 trains the line predictor with the address of bits of branch pattern history for up to 1,024 jumps and subroutine calls that use direct reg- branches. This 10-bit pattern picks from one ister addressing. Code using dynamically of 1,024 prediction counters. The global pre- linked library routines will thus benefit after dictor is a 4,096-entry table of 2-bit saturat- the line predictor is trained with the target. ing counters indexed by the path, or global, This is important since the pipeline delays history of the last 12 branches. The choice pre- required to calculate the indirect (subroutine) diction, or chooser, is also a 4,096-entry table jump address are eight cycles or more. of 2-bit prediction counters indexed by the An instruction cache miss forces the path history. The “Local and global branch instruction fetch engine to check the level-two predictors” box describes these techniques in (L2) cache or system memory for the neces- more detail.

Load more