<<

THE ITANIUM PROCESSOR EMPLOYS THE EPIC DESIGN STYLE TO EXPLOIT

INSTRUCTION-LEVEL PARALLELISM. ITS HARDWARE AND WORK IN

CONCERT TO DELIVER HIGHER PERFORMANCE THROUGH A SIMPLER, MORE

EFFICIENT DESIGN.

The Itanium processor is the first ic runtime optimizations to enable the com- implementation of the IA-64 instruction set piled code schedule to flow through at high architecture (ISA). The design team opti- throughput. This strategy increases the syn- mized the processor to meet a wide range of ergy between hardware and software, and requirements: high performance on leads to higher overall performance. servers and , support for 64-bit The processor provides a six-wide and 10- addressing, reliability for mission-critical stage deep , running at 800 MHz on applications, full IA-32 instruction set com- a 0.18-micron . This combines both patibility in hardware, and scalability across a abundant resources to exploit ILP and high range of operating systems and platforms. frequency for minimizing the latency of each The processor employs EPIC (explicitly instruction. The resources consist of four inte- parallel instruction computing) design con- ger units, four units, two Harsh Sharangpani cepts for a tighter coupling between hardware load/store units, three branch units, two and software. In this design style the hard- extended-precision floating-point units, and Ken Arora ware-software interface lets the software two additional single-precision floating-point exploit all available compilation time infor- units (FPUs). The hardware employs dynam- mation and efficiently deliver this informa- ic prefetch, branch prediction, nonblocking tion to the hardware. It addresses several caches, and a register scoreboard to optimize fundamental performance bottlenecks in for compilation time nondeterminism. Three modern , such as memory latency, levels of on-package minimize overall disambiguation, and control memory latency. This includes a 4-Mbyte flow dependencies. level-3 (L3) cache, accessed at core speed, pro- EPIC constructs provide powerful archi- viding over 12 Gbytes/s of data bandwidth. tectural semantics and enable the software to The system provides glueless multi- make global optimizations across a large processor support for up to four-processor sys- scheduling scope, thereby exposing available tems and can be used as an effective building instruction-level parallelism (ILP) to the hard- block for very large systems. The advanced ware. The hardware takes advantage of this FPU delivers over 3 Gflops of numeric capa- enhanced ILP, providing abundant execution bility (6 Gflops for single precision). The bal- resources. Additionally, it focuses on dynam- anced core and memory subsystems provide

24 0272-1732/00/$10.00  2000 IEEE -programmed features: Explicit Register Branch parallelism; Data and control Memory stack, Predication hints instruction speculation hints rotation templates

Hardware features:

Fetch IssueRegister Control Parallel resources Memory handling subsystem 4 integer, 4 MMX units 128 GR, 2 + 2 FMACs Instruction 128 FR, cache, register Three levels branch remap, 2 load/store units of cache predictors stack (L1, L2, L3) engine 3 branch units Fast, simple 6-issue Fast, 32-entry ALAT Bypasses and dependencies Speculation deferral management

Figure 1. Conceptual view of EPIC hardware. GR: general register file; FR: floating-point regis- ter file high performance for a wide range of appli- events that are unpredictable at compi- cations ranging from commercial workloads lation time so that the compiled code to high-performance technical computing. flows through the pipeline at high In contrast to traditional processors, the throughput. machine’s core is characterized by hardware support for the key ISA constructs that Figure 1 presents a conceptual view of the embody the EPIC design style.1,2 This EPIC hardware. It illustrates how the various includes support for speculation, predication, EPIC instruction set features map onto the explicit parallelism, register stacking and rota- micropipelines in the hardware. tion, branch hints, and memory hints. In this The core of the machine is the wide execu- article we describe the hardware support for tion engine, designed to provide the compu- these novel constructs, assuming a basic level tational bandwidth needed by ILP-rich EPIC of familiarity with the IA-64 architecture (see code that abounds in speculative and predi- the “IA-64 Architecture Overview” article in cated operations. this issue). The execution control is augmented with a bookkeeping structure called the advanced EPIC hardware load address table (ALAT) to support data The Itanium processor introduces a num- speculation and, with hardware, to manage ber of unique microarchitectural features to the deferral of exceptions on speculative exe- support the EPIC design style.2 These features cution. The hardware control for speculation focus on the following areas: is quite simple: adding an extra bit to the data path supports deferred exception tokens. The • supplying plentiful fast, parallel, and controls for both scoreboard and pipelined execution resources, exposed bypass network are enhanced to accommo- directly to the software; date predicated execution. • supporting the bookkeeping and control Operands are fed into this wide execution for new EPIC constructs such as predi- core from the 128-entry integer and floating- cation and speculation; and point register files. The register file addressing • providing dynamic support to handle undergoes register remapping, in support of

SEPTEMBER–OCTOBER 2000 25 ITANIUM PROCESSOR

M F I M F I 6 instructions provide: semantically richer register-remapping hard- ¥ 12 parallel ops/ ware. Expensive register dependency-detec- for scientific computing tion logic is eliminated via the explicit ¥ 20 parallel ops/clock for Load 4 DP 2 ALU ops parallelism directives that are precomputed by digital content creation (8 SP) ops via the software. 2 ldf-pair and 2 4 DP ALU ops (8 SP flops) Using EPIC constructs, the compiler opti- (postincrement) mizes the code schedule across a very large scope. This scope of optimization far exceeds M I I M B B 6 instructions provide: the limited hardware window of a few hun- ¥ 8 parallel ops/clock dred instructions seen on contemporary for enterprise and Internet applications dynamically scheduled processors. The result 2 loads and 2 ALU ops is an EPIC machine in which the close col- 2 ALU ops 2 branch (postincrement) instructions laboration of hardware and software enables high performance with a greater degree of Figure 2. Two examples illustrating supported parallelism. SP: single preci- overall efficiency. sion, DP: double precision Overview of the EPIC core The engineering team designed the EPIC register stacking and rotation. The register core of the Itanium processor to be a parallel, management hardware is enhanced with a deep, and dynamic pipeline that enables ILP- control engine called the register stack engine rich compiled code to flow through at high that is responsible for saving and restoring throughput. At the highest level, three impor- registers that overflow or underflow the reg- tant directions characterize the core pipeline: ister stack. An instruction dispersal network feeds the • wide EPIC hardware delivering a new execution pipeline. This network uses explic- level of parallelism (six instructions/ it parallelism and instruction templates to effi- clock), ciently issue fetched instructions onto the • deep pipelining (10 stages) enabling high correct instruction ports, both eliminating frequency of operation, and complex dependency detection logic and • dynamic hardware for runtime opti- streamlining the instruction routing network. mization and handling of compilation A decoupled fetch engine exploits advanced time indeterminacies. prefetch and branch hints to ensure that the fetched instructions will come from the cor- New level of parallel execution rect path and that they will arrive early enough The processor provides hardware for these to avoid cache miss penalties. Finally, memo- execution units: four integer ALUs, four mul- ry locality hints are employed by the cache timedia ALUs, two extended-precision float- subsystem to improve the cache allocation and ing-point units, two additional single-precision replacement policies, resulting in a better use floating-point units, two load/store units, and of the three levels of on-package cache and all three branch units. The machine can fetch, associated memory bandwidth. issue, execute, and retire six instructions each EPIC features allow software to more effec- clock cycle. Given the powerful semantics of tively communicate high-level semantic infor- the IA-64 instructions, this expands to many mation to the hardware, thereby eliminating more operations being executed each cycle. redundant or inefficient hardware and lead- The “Machine resources per port” sidebar on ing to a more effective design. Notably absent p. 31 enumerates the full processor execution from this machine are complex hardware resources. structures seen in dynamically scheduled con- Figure 2 illustrates two examples demon- temporary processors. Reservation stations, strating the level of parallel operation support- reorder buffers, and buffers ed for various workloads. For enterprise and are all replaced by simpler hardware for spec- commercial codes, the MII/MBB template ulation. Register alias tables used for register combination in a bundle pair provides six renaming are replaced with the simpler and instructions or eight parallel operations per

26 IEEE MICRO clock (two load/store, two Execution core Front end general-purpose ALU opera- 4 single-cycle ALUs, 2 load/stores Prefetch/fetch of 6 instructions/clock Advanced load control tions, two postincrement ALU Hierarchy of branch predictors Predicate delivery and branch operations, and two branch Decoupling buffer NaT/exceptions/retirement instructions). Alternatively, an MIB/MIB pair allows the IPG FET ROT EXP REN WLD REG EXE DET WRB same mix of operations, but with one branch hint and one Instruction delivery Operand delivery branch operation, instead of Dispersal of 6 instructions onto read and bypass 9 issue ports Register scoreboard two branch operations. Register remapping Predicated dependencies For scientific code, the use Register save engine of the MFI template in each bundle enables 12 parallel Figure 3. Itanium processor core pipeline. operations per clock (loading four double-precision oper- ands to the registers and executing four double- levels of on-package cache. The pipeline precision floating-point, two integer ALU, and design employs robust, scalable circuit design two postincrement ALU operations). For digi- techniques. We consciously attempted to tal content creation codes that use single- manage interconnect lengths and pipeline precision floating point, the SIMD (single away secondary paths. instruction, multiple data) features in the Figure 3 illustrates the 10-stage core machine effectively enable up to 20 parallel pipeline. The bold line in the middle of the operations per clock (loading eight single- core pipeline indicates a point of decoupling in precision operands, executing eight single- the pipeline. The pipeline accommodates the precision floating-point, two integer ALU, and decoupling buffer in the ROT (instruction two postincrementing ALU operations). rotation) stage, dedicated register-remapping hardware in the REN (register rename) stage, Deep pipelining (10 stages) and pipelined access of the large register file The cycle time and the core pipeline are bal- across the WLD (word line decode) and REG anced and optimized for the sequential exe- (register read) stages. The DET (exception cution in integer scalar codes, by minimizing detection) stage accommodates delayed branch the latency of the most frequent operations, execution as well as memory exception man- thus reducing dead time in the overall com- agement and speculation support. putation. The high frequency (800 MHz) and careful pipelining enable independent opera- Dynamic hardware for runtime optimization tions to flow through this pipeline at high While the processor relies on the compiler throughput, thus also optimizing vector to optimize the code schedule based upon the numeric and multimedia computation. The deterministic latencies, the processor provides cycle time accommodates the following key special support to dynamically optimize for critical paths: several compilation time indeterminacies. These dynamic features ensure that the com- • single-cycle ALU, globally bypassed piled code flows through the pipeline at high across four ALUs and two loads; throughput. To tolerate additional latency on • two cycles of latency for load data data cache misses, the data caches are non- returned from a dual-ported level-1 (L1) blocking, a register scoreboard enforces cache of 16 Kbytes; and dependencies, and the machine stalls only on • scoreboard and dependency control to encountering the use of unavailable data. stall the machine on an unresolved regis- We focused on reducing sensitivity to branch ter dependency. and fetch latency. The machine employs hard- ware and software techniques beyond those The feature set complies with the high- used in conventional processors and provides frequency target and the degree of pipelin- aggressive instruction prefetch and advanced ing—aggressive branch prediction, and three branch prediction through a hierarchy of

SEPTEMBER–OCTOBER 2000 27 ITANIUM PROCESSOR

L1 instruction cache and ITLB fetch/prefetch engine Branch prediction Decoupling IA-32 buffer 8 bundles decode and control B B B M M I I F F

Register stack engine/remapping

96-Kbyte 4-Mbyte L2 Branch and 128 integer 128 floating-point L3 cache predicate registers registers cache

Integer Dual- Branch and port units MM L1 units data ALAT Floating- cache point units Scoreboard, predicate NaTs, exceptions Scoreboard, predicate NaTs,

SIMD FMAC

Bus controller

Figure 4. Itanium processor block diagram.

branch prediction structures. A decoupling sor (six instructions per clock), an aggressive buffer allows the front end to speculatively fetch front end is needed to keep the machine effec- ahead, further hiding instruction cache laten- tively fed, especially in the presence of dis- cy and branch prediction latency. ruptions due to branches and cache misses. The machine’s front end is decoupled from Block diagram the back end. Figure 4 provides the block diagram of the Acting in conjunction with sophisticated Itanium processor. Figure 5 provides a die plot branch prediction and correction hardware, of the database. A few top-level metal the machine speculatively fetches instructions layers have been stripped off to create a suit- from a moderate-size, pipelined instruction able view. cache into a decoupling buffer. A hierarchy of branch predictors, aided by branch hints, pro- Details of the core pipeline vides up to four progressively improving The following describes details of the core instruction pointer resteers. Software-initiat- processor microarchitecture. ed prefetch probes for future misses in the instruction cache and then prefetches such Decoupled software-directed front end target code from the level-2 (L2) cache into a Given the high execution rate of the proces- streaming buffer and eventually into the

28 IEEE MICRO Core processor die

4 x 1Mbyte L3 cache

Figure 5. Die plot of the silicon database.

instruction cache. Figure 6 To dispersal illustrates the front-end Instruction cache, Decoupling buffer microarchitecture. ITLB, streaming buffers

IP Loop exit corrector

Speculative fetches. The 16- Return stack buffer Kbyte, four-way set-associative Branch Adaptive Branch Branch Target target instruction cache is fully multiway address address address address pipelined and can deliver 32 2-level predictor calculation 1 calculation 2 registers cache of code (two instruction bundles or six instructions) every clock. The cache is sup- IPG FET ROT EXP ported by a single-cycle, 64- Stages entry instruction translation look-aside buffer (TLB) that Figure 6. Processor front end. is fully associative and backed up by an on-chip hardware walker. be nine cycles of pipeline bubbles before the The fetched code is fed into a decoupling pipeline is full again. This would mean a buffer that can hold eight bundles of code. As heavy performance loss. Hence, we’ve placed a result of this buffer, the machine’s front end significant emphasis on boosting the overall can continue to fetch instructions into the branch prediction rate as well as reducing the buffer even when the back end stalls. Con- branch prediction and correction latency. versely, the buffer can continue to feed the The branch prediction hardware is assisted back end even when the front end is disrupt- by branch hint directives provided by the ed by fetch bubbles due to branches or compiler (in the form of explicit branch pre- instruction cache misses. dict, or BRP, instructions as well as hint spec- ifiers on branch instructions). The directives Hierarchy of branch predictors. The processor provide branch target addresses, static hints employs a hierarchy of branch prediction on branch direction, as well as indications on structures to deliver high-accuracy and low- when to use dynamic prediction. These direc- penalty predictions across a wide spectrum of tives are programmed into the branch predic- workloads. Note that if a branch mispredic- tion structures and used in conjunction with tion led to a full pipeline flush, there would dynamic prediction schemes. The machine

SEPTEMBER–OCTOBER 2000 29 ITANIUM PROCESSOR

provides up to four progressive predictions by branch hints (using BRP and move- and corrections to the fetch pointer, greatly BR hint instructions) and also managed reducing the likelihood of a full-pipeline flush dynamically. Having the compiler pro- due to a mispredicted branch. gram the structure with the upcoming footprint of the program is an advantage • Resteer 1: Single-cycle predictor. A special and enables a small, 64-entry structure to set of four branch prediction registers be effective even on large commercial (called target address registers, or TARs) workloads, saving die area and imple- provides single-cycle turnaround on cer- mentation complexity. The BPT and tain branches (for example, loop branch- MBPT cause a front-end resteer only if es in numeric code), operating under tight the target address for the resteer is present compiler control. The compiler programs in the target address cache. In the case of these registers using BRP hints, distin- misses in the BPT and MBPT, a hit in the guishing these hints with a special “impor- target address cache also provides a branch tance” bit designator and indicating that direction prediction of taken. these directives must get allocated into this A return stack buffer (RSB) provides small structure. When the instruction predictions for return instructions. This pointer of the candidate branch hits in buffer contains eight entries and stores these registers, the branch is predicted return addresses along with correspond- taken, and these registers provide the tar- ing register stack frame information. get address for the resteer. On such taken • Resteers 3 and 4: Branch address calcula- branches no bubbles appear in the execu- tion and correction. Once branch instruc- tion schedule due to branching. tion are available (ROT stage), • Resteer 2: Adaptive multiway and return it’s possible to apply a correction to pre- predictors. For scalar codes, the processor dictions made earlier. The BAC1 stage employs a dynamic, adaptive, two-level applies a correction for the exit condition prediction scheme3,4 to achieve well over on modulo-scheduled loops through a 90% prediction rates on branch direction. special “perfect-loop-exit-predictor” The branch prediction table (BPT) con- structure that keeps track of the loop tains 512 entries (128 sets × 4 ways). Each count extracted during the loop initial- entry, selected by the branch address, ization code. Thus, loop exits should tracks the four most recent occurrences never see a branch misprediction in the of that branch. This 4-bit value then back end. Additionally, in case of misses indexes into one of 128 pattern tables in the earlier prediction structures, BAC1 (one per set). The 16 entries in each pat- extracts static prediction information and tern table use a 2-bit, saturating, up-down addresses from branch instructions in the to predict branch direction. rightmost slot of a bundle and uses these The branch prediction table structure to provide a correction. Since most tem- is additionally enhanced for multiway plates will place a branch in the rightmost branches with a 64-entry, multiway slot, BAC1 should handle most branch- branch prediction table (MBPT) that es. BAC2 applies a more general correc- employs a similar algorithm but keeps tion for branches located in any slot. three history registers per bundle entry. A find-first-taken selection provides the Software-initiated prefetch. Another key ele- first taken branch indication for the mul- ment of the front end is its software-initiated tiway branch bundle. Multiway branch- instruction prefetch. Prefetch is triggered by es are expected to be common in EPIC prefetch hints (encoded in the BRP instruc- code, where multiple basic blocks are tions as well as in actual branch instructions) expected to collapse after use of specula- as they pass through the ROT stage. Instruc- tion and predication. tions get prefetched from the L2 cache into Target addresses for this branch resteer an instruction-streaming buffer (ISB) con- are provided by a 64-entry target address taining eight 32- entries. Support exists cache (TAC). This structure is updated to prefetch either a short burst of 64 bytes of

30 IEEE MICRO code (typically, a basic block residing in up to four bundles) or a long sequential instruction Machine resources per port stream. Short burst prefetch is initiated by a Tables A-C describe the vocabulary of operations supported on the different issue ports BRP instruction hoisted well above the actu- in the Itanium processor. The issue ports feed into memory (M), integer (I), floating-point (F), al branch. For longer code streams, the and branch (B) execution data paths. sequential streaming (“many”) hint from the branch instruction triggers a continuous Table A. Memory and integer execution resources. stream of additional prefetch requests until a taken branch is encountered. The instruction Ports for issuing an instruction Latency cache filters prefetch requests. The cache tags Instruction class M0 M1 I0 I1 (no. of clock cycles) and the TLB have been enhanced with an additional port to check whether an address ALU (add, shift-add, logical, will lead to a miss. Such requests are sent to addp4, compare) • • • • the L2 cache. Sign/zero extend, move long • • 1 The compiler can improve overall fetch per- Fixed extract/deposit, Tbit, TNaT • 1 formance by aggressive issue and hoisting of Multimedia ALU (add/avg./etc.) • • • • 2 BRP instructions, and by issuing sequential MM shift, avg, mix, pack • • 2 prefetch hints on the branch instruction when Move to/from branch/predicates/ branching to long sequential codes. To fully ARs, packed multiply, pop count • 2 hide the latency of returns from the L2 cache, Load/store/prefetch/setf/break.m/ BRP instructions that initiate prefetch should cache control/memory fence • • 2+ be hoisted 12 fetch cycles ahead of the branch. Memory management/system/getf • 2+ Hoisting by five cycles breaks even with no prefetch at all. Every hoisted cycle above five cycles has the potential of shaving one fetch Table B. Floating-point execution resources. bubble. Although this kind of hoisting of BRP instructions is a tall order, it does provide a Ports for issuing an instruction Latency mechanism for the compiler to eliminate Instruction class F0 F1 (no. of clock cycles) instruction fetch bubbles. FMAC, SIMD FMAC • • 5 Efficient instruction and operand delivery Fixed multiply • • 7 After instructions are fetched in the front FClrf • • 1 end, they move into the middle pipeline that Fchk • • 1 disperses instructions, implements the archi- Fcompare • 2 tectural renaming of registers, and delivers Floating-point logicals, class, operands to the wide parallel hardware. The min/max, pack, select • 5 hardware resources in the back end of the machine are organized around nine issue ports. The instruction and operand delivery hardware Table C. Branch execution resources. maps the six incoming instructions onto the nine issue ports and remaps the virtual register Ports for issuing an instruction identifiers specified in the source code onto Instruction class B0 B1 B2 physical registers used to access the register file. It then provides the source data to the execution Conditional or unconditional branch • • • core. The dispersal and renaming hardware Call/return/indirect • • • exploits high-level semantic information pro- Loop-type branch, BSW, cover • vided by the IA-64 software, efficiently RFI • enabling greater ILP and reduced instruction BRP (branch hint) • • • path length.

Explicit parallelism directives. The instruction dispersal mechanism disperses instructions pre- sor’s issue ports. The processor has a total of sented by the decoupling buffer to the proces- nine issue ports capable of issuing up to two

SEPTEMBER–OCTOBER 2000 31 ITANIUM PROCESSOR

Table 1. Instruction bundles capable of tion resources to process all the instructions full-bandwidth dispersal. that will be issued in parallel. This over- subscription problem is facilitated by the First bundle* Second bundle IA-64 ISA feature of instruction bundle MIH MLI, MFI, MIB, MBB, or MFB templates. Each instruction bundle not MFI or MLI MLI, MFI, MIB, MBB, BBB, or MFB only specifies three instructions but also MII MBB, BBB, or MFB contains a 4-bit template field, indicating MMI BBB the type of each instruction: memory (M), MFH MII, MLI, MFI, MIB, MBB, MFB integer (I), branch (B), and so on. By examining template fields from the two * B slots support branches and branch hints. bundles (a total of only 8 bits), the disper- * H designates a branch hint operation in the B slot. sal logic can quickly determine the num- ber of memory, integer, floating-point, and branch instructions incoming every clock. memory instructions (ports M0 and M1), two This is a hardware simplification resulting integer (ports I0 and I1), two floating-point from the IA-64 instruction set architecture. (ports F0 and F1), and three branch instruc- Unlike conventional instruction set archi- tions (ports B0, B1, and B2) per clock. The tectures, the instruction encoding itself processor’s 17 execution units are fed through doesn’t need to be examined to determine the M, I, F, and B groups of issue ports. the type of each operation. This feature The decoupling buffer feeds the dispersal removes decoders that would otherwise be in a bundle granular fashion (up to two bun- required to examine many bits of the dles or six ), with a fresh encoded instruction to determine the in- bundle being presented each time one is con- struction’s type and associated issue port. sumed. Dispersal from the two bundles is A second key advantage of the tem- instruction granular—the processor disperses plate-based dispersal strategy is that cer- as many instructions as can be issued (up to tain instruction types can only occur on six) in left-to-right order. The dispersal algo- specific locations within any bundle. As rithm is fast and simple, with instructions a result, the dispersal interconnection being dispersed to the first available issue port, network can be significantly optimized; subject to two constraints: detection of in- the routing required from dispersal to struction independence and detection of issue ports is roughly only half of that resource oversubscription. required for a fully connected crossbar.

• Independence. The processor must ensure Table 1 illustrates the effectiveness of the that all instructions issued in parallel are dispersal strategy by enumerating the instruc- either independent or contain only tion bundles that may be issued at full band- allowed dependencies (such as a compare width. As can be seen, a rich mix of instruction feeding a dependent condi- instructions can be issued to the machine at tional branch). This question is easily dealt high throughput (six per clock). The combi- with by using the stop-bits feature of the nation of stop bits and bundle templates, as IA-64 ISA to explicitly communicate par- specified in the IA-64 instruction set, allows allel instruction semantics. Instructions the compiler to indicate the independence and between consecutive stop bits are deemed instruction-type information directly and independent, so the instruction indepen- effectively to the dispersal hardware. As a dence detection hardware is trivial. This result, the hardware is greatly simplified, there- contrasts with traditional RISC processors by allowing an efficient implementation of that are required to perform O(n2) (typi- instruction dispersal to a wide execution core. cally dozens) comparisons between source and destination register specifiers to deter- Efficient register remapping. After dispersal, the mine independence. step in preparing incoming instructions • Oversubscription. The processor must also for execution involves implementing the reg- guarantee that there are sufficient execu- ister stacking and rotation functions.

32 IEEE MICRO Register stacking is an IA-64 technique that nation registers. The total area taken by this significantly reduces function call and return function is less than 0.25 square mm. overhead. It ensures that all procedural input The register-stacking model also requires and output parameters are in specific register special handling when software allocates more locations, without requiring the compiler to virtual registers than are currently physically perform register-register or memory-register available in the register file. A special moves. On procedure calls, a fresh , the register stack engine (RSE), han- frame is simply stacked on top of existing dles this case—termed stack overflow. This frames in the large register file, without the engine observes all stacked register allocation need for an explicit save of the caller’s registers. or deallocation requests. When an overflow is This enables low-overhead procedure calls, detected on a procedure call, the engine silent- providing significant performance benefit on ly takes control of the pipeline, spilling regis- codes that are heavy in calls and returns, such ters to a backing store in memory until as those in object-oriented languages. sufficient physical registers are available. In a Register rotation is an IA-64 technique that similar manner, the engine handles the con- allows very low overhead, software-pipelined verse situation—termed stack underflow— loops. It broadens the applicability of com- when registers need to be restored from a piler-driven software pipelining to a wide vari- backing store in memory. While these registers ety of integer codes. Rotation provides a form are being spilled or filled, the engine simply of that allows every itera- stalls instructions waiting on the registers; no tion of a software-pipelined loop to have a pipeline flushes are needed to implement the fresh copy of loop variables. This is accom- register spill/restore operations. plished by accessing the registers through an Register stacking and rotation combine to indirection based on the iteration count. provide significant performance benefits for a Both stacking and rotation require the variety of applications, at the modest cost of hardware to remap the register names. This a number of small adders, an additional remapping translates the incoming virtual reg- pipeline stage, and control logic for a pro- ister specifiers onto outgoing physical register grammer-invisible register stack engine. specifiers, which are then used to perform the actual lookup of the various register files. Large, multiported register files. The processor Stacking can be thought of as simply adding provides an abundance of registers and execu- an offset to the virtual register specifier. In a tion resources. The 128-entry integer register similar fashion, rotation can also be viewed as file supports eight read ports and six write an offset-modulo add. The remapping func- ports. Note that four ALU operations require tion supports both stacking and rotation for eight read ports and four write ports from the the integer register specifiers, but only register register file, while pending load data returns rotation for the floating-point and predicate need two additional write ports (two returns register specifiers. per cycle). The read and write ports can ade- The Itanium processor efficiently supports quately support two memory and two integer the register remapping for both register stack- instructions every clock. The IA-64 instruc- ing and rotation with a set of adders and mul- tion set includes a feature known as postin- tiplexers contained in the pipeline’s REN crement. Here, the address register of a stage. The stacking logic requires only one 7- memory operation can be incremented as a bit for each specifier, and the rotation side effect of the operation. This is supported logic requires either one (predicate or float- by simply using two of the four ALU write ing-point) or two (integer) additional 7-bit ports. (These two ALUs and write ports would adders. The extra adder on the integer side is otherwise have been idle when memory oper- needed due to the interaction of stacking with ations are issued off their ports). rotation. Therefore, for full six-syllable exe- The floating-point register file also consists cution, a total of ninety-eight 7-bit adders and of 128 registers, supports double extended- 42 implement the combination precision arithmetic, and can sustain two of integer, floating-point, and predicate memory ports in parallel with two multiply- remapping for all incoming source and desti- accumulate units. This combination of

SEPTEMBER–OCTOBER 2000 33 ITANIUM PROCESSOR

Clock 1: cmp.eq rl,r2 → pl, p3 Compute predicates P1, P3 resulting data may potentially not be con- cmp.eq r3, r4 → p2, p4;; Compute predicates P2,P4 sumed. For such cases, it is key that 1) the pipeline not be interrupted because of a cache Clock 2: (p1) ld4 [r3] → r4;; Load nullified if P1=False (Producer nullification) miss, and 2) the pipeline only be interrupted if and when the unavailable data is needed. ClockN: (p4) add r4, r1 → r5 Add nullified if P4=False (Consumer nullification) Thus, to achieve high performance, the Note that a hazard exists only if strategy for dealing with detected data haz- (a) p1=p2=true AND ards is based on stalls—the pipeline only stalls (b) the r4 result is not available when the add collects its source data when unavailable data is needed and stalls Figure 7. Predicated producer-consumer dependencies. only as long as the data is unavailable. This strategy allows the entire processor pipeline to remain filled, and the in-flight dependent resources requires eight read and four write instructions to be immediately ready to con- ports. The register write ports are separated tinue as soon as the required data is available. in even and odd banks, allowing each mem- This contrasts with other high-frequency ory return to update a pair of floating-point designs, which are based on flushing and registers. require that the pipeline be emptied when a The other large register file is the predicate hazard is detected, resulting in reduced per- register file. This register file has several formance. On the Itanium processor, innov- unique characteristics: each entry is 1 bit, it ative techniques reap the performance benefits has many read and write ports (15 reads/11 of a stall-based strategy and yet enable high- writes), and it supports a “broadside” read or frequency operation on this wide machine. write of the entire register file. As a result, it The scoreboard control is also enhanced to has a distinct implementation, as described support predication. Since most operations in the “Implementing predication elegantly” within the IA-64 instruction set architecture section (next page). can be predicated, either the producer or the consumer of a given piece of data may be nul- High ILP execution core lified by having a false predicate. Figure 7 illus- The execution core is the heart of the EPIC trates an example of such a case. Note that if implementation. It supports data-speculative either the producer or consumer operation is and control-, as well as nullified via predication, there are no hazards. predicated execution and the traditional func- The processor scoreboard therefore considers tions of hazard detection and branch execu- both the producer and consumer predicates, tion. Furthermore, the processor’s execution in to the normal operand availabili- core provides these capabilities in the context ty, when evaluating whether a hazard exists. of the wide execution width and powerful This hazard evaluation occurs in the REG instruction semantics that characterize the (register read) pipeline stage. EPIC design philosophy. Given the high frequency of the processor pipeline, there’s not sufficient time to both Stall-based scoreboard control strategy. As men- compute the existence of a hazard, and effect a tioned earlier, the frequency target of the Ita- global in a single clock cycle. nium processor was governed by several key Hence, we use a unique deferred-stall strategy. timing paths such as the ALU plus bypass and This approach allows any dependent consumer the two-cycle data cache. All of the control instructions to proceed from the REG into the paths within the core pipeline fit within the EXE (execute) pipeline stage, where they are given cycle time—detecting and dealing with then stalled—hence the term deferred stall. data hazards was one such key control path. However, the instructions in the EXE stage To achieve high performance, we adopted no longer have read port access to the register a nonblocking cache with a scoreboard-based file to obtain new operand data. Therefore, to stall-on-use strategy. This is particularly valu- ensure that the instructions in the EXE stage able in the context of speculation, in which procure the correct data, the latches at the start certain load operations may be aggressively of the EXE stage (which contain the source boosted to avoid cache miss latencies, and the data values) continuously snoop all returning

34 IEEE MICRO data values, intercepting any data that the instruction requires. The logic used to per- IA-32 compatibility form this data interception is identical to the Another key feature of the Itanium processor is its full support of the IA-32 instruction set register bypass network used to collect in hardware (see Figure A). This includes support for running a mix of IA-32 applications and operands for instructions in the REG stage. IA-64 applications on an IA-64 , as well as IA-32 applications on an IA-32 By noting that instructions observing a operating system, in both uniprocessor and multiprocessor configurations. The IA-32 engine deferred stall in the REG stage don’t require makes use of the EPIC machine’s registers, caches, and execution resources. To deliver high the use of the bypass network, the EXE stage performance on legacy binaries, the IA-32 engine dynamically schedules instructions.1,2 The instructions can usurp the bypass network for IA-64 Seamless Architecture is defined to enable running IA-32 system functions in native the deferred stall. By reusing existing register IA-64 mode, thus delivering native performance levels on the system functionality. bypass hardware, the deferred stall strategy is implemented in an area-efficient manner. This allows the processor to combine the ben- References efits of high frequency with stall-based 1. R. Colwell and R. Steck, “A 0.6µm BICMOS with Dynamic pipeline control, thereby precluding the Execution,” Proc. Int’l Solid-State Circuits Conf., IEEE Press, Piscataway, N.J., penalty of pipeline flushes due to replays on 1995, pp. 176-177. register hazards. 2. D. Papworth, “Tuning the Pro Microarchitecture,” IEEE Micro, Mar./Apr. 1996, pp. 8-15. Execution resources. The processor provides an abundance of execution resources to exploit ILP. The integer execution core includes two IA-32 instruction Shared fetch and decode instruction cache memory and two integer ports, with all four and TLB ports capable of executing arithmetic, shift- and-add, logical, compare, and most integer SIMD multimedia operations. The memory IA-32 dynamic ports can also perform load and store opera- and scheduler tions, including loads and stores with postin- Shared IA-64 crement functionality. The integer ports add execution the ability to perform the less-common inte- IA-32 retirement core and exceptions ger instructions, such as test bit, look for zero byte, and variable shift. Additional uncom- mon instructions are also implemented on Figure A. IA-32 compatibility microarchitecture. only the first integer port. See the earlier sidebar for a full enumera- tion of the per-port capabilities and associat- tions get introduced during predicated exe- ed instruction latencies on the processor. In cution, the benefit of branch misprediction general, we designed the method used to map elimination will be squandered. Care was instructions onto each port to maximize over- taken to ensure that predicates are imple- all performance, by balancing the instruction mented transparently in the pipeline. frequency with the area and timing impact of The basic strategy for predicated execution additional execution resources. is to allow all instructions to read the register file and get issued to the hardware regardless Implementing predication elegantly. Predication of their predicate value. Predicates are used to is another key feature of the IA-64 architec- configure the data-forwarding network, detect ture, allowing higher performance by elimi- the presence of hazards, control pipeline nating branches and their associated advances, and conditionally nullify the exe- misprediction penalties.5 However, predica- cution and retirement of issued operations. tion affects several key aspects of the pipeline Predicates also feed the branching hardware. design. Predication turns a control depen- The predicate register file is a highly multi- dency (branching on the condition) into a ported structure. It is accessed in parallel with data dependency (execution and forwarding the general registers in the REG stage. Since of data dependent upon the value of the pred- predicates themselves are generated in the exe- icate). If spurious stalls and pipeline disrup- cution core (from compare instructions, for

SEPTEMBER–OCTOBER 2000 35 ITANIUM PROCESSOR

Register file data sumption. The costly bypass logic that would have been "True" source data Added to support predication needed for this is eliminated by taking advantage of the Bypass forwarded data Source RegID fact that all predicate-writing multiplexer instructions have determinis- Source bypass tic latency. Instead, a specula- Forwarded destination RegID =? tive predicate register file AND Forwarded instruction predicate (SPRF) is used and updated Bypass forwarded data as soon as predicate data is computed. The source pred- Figure 8. Predicated bypass control. icate of any dependent instruction is then read directly from this register file, example) and may be in flight when they’re obviating the need for bypass logic. A sepa- needed, they must be forwarded quickly to rate architectural predicate register file (APRF) the specific hardware that consumes them. is only updated when a predicate-writing Note that predication affects the hazard instruction retires and is only then allowed to detection logic by nullifying either data pro- update the architectural state. ducer or consumer instructions. Consumer In case of an exception or pipeline flush, the nullification is performed after the SPRF is copied from the APRF in the shadow predicate register file (PRF) for the predicate of the flush latency, undoing the effect of any sources of the six instructions in the REG misspeculative predicate writes. The combi- pipeline stage. Producer nullification is per- nation of latch-based implementation and the formed after reading the predicate register file two-file strategy allow an area-efficient and for the predicate sources for the six instruc- timing-efficient implementation of the high- tions in the EXE stage. ly ported predicate registers. Finally, three conditional branches can be Figure 8 shows one of the six EXE stage pred- executed in the DET pipeline stage; this icates that allow or nullify data forwarding in requires reading three additional predicate the data-forwarding network. The other five sources. Thus, a total of 15 read ports are predicates are handled identically. Predication needed to access the predicate register file. control of the bypass network is implemented From a write port perspective, 11 predicates very efficiently by ANDing the predicate value can be written every clock: eight from four with the destination-valid signal present in con- parallel integer compares, two from a float- ventional bypass logic networks. Instructions ing-point compare, and one via the stage pred- with false predicates are treated as merely not icate write feature of loop branches. These writing to their destination register. Thus, the read and write ports are in addition to a broad- impact of predication on the operand- side read and write capability that allows a sin- forwarding network is fairly minimal. gle instruction to read or write the entire 64-entry predicate register into or from a sin- Optimized speculation support in hardware. gle 64-bit integer register. The predicate reg- With minimal hardware impact, the Itanium ister file is implemented as a single 64-bit latch processor enables software to hide the latency with 15 simple 64:1 multiplexers being used of load instructions and their dependent uses as the read ports. Similarly, the 11 write ports by boosting them out of their home basic are efficiently implemented, with each being block. This is termed speculation. To perform a 6:64 decoder, with an AND-OR structure effective speculation, two key issues must be used to update the actual predicate register file addressed. First, any exceptions that are detect- latch. Broadside reads and writes are easily ed must be deferrable until an operation’s implemented by reading or writing the con- home basic block is encountered; this is termed tents of the entire 64 bit latch. control speculation. Second, all stores between In-flight predicates must be forwarded the boosted load and its home location must quickly after generation to the point of con- be checked for address overlap. If there is an

36 IEEE MICRO overlap, the latest store should forward the cor- rect data; this is termed data speculation. The With minimal hardware Itanium processor provides effective support for both forms of speculation. impact, the Itanium processor In case of control speculation, normal exception checks are performed for a control- enables software to hide the speculative load instruction. In the common case, no exception is encountered, and there- latency of load instructions fore no special handling is required. On a detected exception, the hardware examines and their dependent uses by the exception type, software-managed archi- tectural control registers, and page attributes boosting them out of their to determine whether the exception should be handled immediately (such as for a TLB miss) home basic block. or deferred for future handling. For a deferral, a special deferred exception token called NaT (Not a Thing) bit is retained for each integer register, and a special float- ing-point value, called NaTVal and encoded in the NaN space, is set for floating-point reg- ware encounters an advanced load, it places isters. This token indicates that a deferred the address, size, and destination register of exception was detected. The deferred excep- the load into the ALAT structure. The ALAT tion token is then propagated into result reg- then observes all subsequent explicit store isters when any of the source registers indicates instructions, checking for overlaps of the valid such a token. The exception is reported when advanced load addresses present in the ALAT. either a speculation check or nonspeculative In the common case, there’s no match, the use (such as a store instruction) consumes a ALAT state is unchanged, and the advanced register that is flagged with the deferred excep- load result is used normally. In the case of an tion token. In this way, NaT generation lever- overlap, all address-matching advanced loads ages traditional exception logic simply, and in the ALAT are invalidated. NaT propagation uses straightforward data After the last undisambiguated store prior path logic. to the load’s home basic block, an instruction The existence of NaT bits and NaTVals also can query the ALAT and find that the affect the register spill-and-fill logic. For advanced load was matched by an interven- explicit software-driven register spills and fills, ing store address. In this situation recovery is special move instructions (store.spill and needed. When only the load and no depen- load.fill) are supported that don’t take excep- dent instructions were boosted, a load-check tions when encountering NaT’ed data. For (ld.c) instruction is used, and the load floating-point data, the entire data is simply instruction is reissued down the pipeline, this moved to and from memory. For integer data, time retrieving the updated memory data. As the extra NaT bit is written into a special reg- an important performance feature, the ld.c ister (called UNaT, or user NaT) on spills, and instruction can be issued in parallel with is read back on the load.fill instruction. The instructions dependent on the load result UNaT register can also be written to memo- data. By allowing this optimization, the crit- ry if more than 64 registers need to be spilled. ical load uses can be issued immediately, In the case of implicit spills and fills generat- allowing the ld.c to effectively be a zero-cycle ed by the register save engine, the engine col- operation. When the advanced load and its lects the NaT bits into another special register dependent uses were boosted, an advanced (called RNaT, or register NaT), which is then check-load (chk.a) instruction traps to a user- spilled (or filled) once for every 64 register save specified handler for a special fix-up code that engine stores (or loads). reissues the load instruction and the opera- For data speculation, the software issues an tions dependent on the load. Thus, support advanced load instruction. When the hard- for data speculation was added to the pipeline

SEPTEMBER–OCTOBER 2000 37 ITANIUM PROCESSOR

Floating-point feature set The FPU in the processor is quite advanced. The native 82-bit hard- ations can yield high throughput on division and square-root operations ware provides efficient support for multiple numeric programming mod- common in 3D geometry codes. els, including support for single, double, extended, and The machine also provides one hardware pipe for execution of FCMPs mixed-mode-precision computations. The wide-range 17-bit exponent and other operations (such as FMERGE, FPACK, FSWAP, FLogicals, recip- enables efficient support for extended-precision library functions as well rocal, and reciprocal square root). Latency of the FCMP operations is two as fast emulation of quad-precision computations. The large 128-entry clock cycles; latency of the other floating-point operations is five clock register file provides adequate register resources. The FPU execution cycles. hardware is based on the floating-point multiply-add (FMAC) primitive, which is an effective building block for scientific computation.1 The Operand bandwidth machine provides execution hardware for four double-precision or eight Care has been taken to ensure that the high computational bandwidth single-precision flops per clock. This abundant computation bandwidth is matched with operand feed bandwidth. See Figure B. The 128-entry is balanced with adequate operand bandwidth from the registers and floating-point register file has eight read and four write ports. Every cycle, memory subsystem. With judicious use of data prefetch instructions, as the eight read ports can feed two extended-precision FMACs (each with well as cache locality and allocation management hints, the software can three operands) as well as two floating-point stores to memory. The four effectively arrange the computation for sustained high utilization of the write ports can accommodate two extended-precision results from the parallel hardware. two FMAC units and the results from two load instructions each clock. To increase the effective write bandwidth into the FPU from memory, we FMAC units divided the floating-point registers into odd and even banks. This enables The FPU supports two fully pipelined, 82-bit FMAC units that can exe- the two physical write ports dedicated to load returns to be used to write cute single, double, or extended-precision floating-point operations. This four values per clock to the register file (two to each bank), using two ldf- delivers a peak of 4 double-precision flops/clock, or 3.2 Gflops at 800 pair instructions. The ldf-pair instructions must obey the restriction that MHz. FMAC units execute FMA, FMS, FNMA, FCVTFX, and FCVTXF oper- the pair of consecutive memory operands being loaded in sends one ations. When bypassed to one another, the latency of the FMAC arith- operand to an even register and the other to an odd register for proper metic operations is five clock cycles. use of the banks. The processor also provides support for executing two SIMD-floating- The earliest cache level to feed the FPU is the unified L2 cache (96 point instructions in parallel. Since each instruction issues two single- Kbytes). Two ldf-pair instructions can load four double-precision values precision FMAC operations (or four single-precision flops), the peak from the L2 cache into the registers. The latency of loads from this cache execution bandwidth is 8 single-precision flops/clock or 6.4 Gflops at 800 to the FPU is nine clock cycles. For data beyond the L2 cache, the band- MHz. Two supplemental single-precision FMAC units support this com- width to the L3 cache is two double-precision operations/clock (one 64- putation. (Since the read of an 82-bit register actually yields two single- byte line every four clock cycles). precision SIMD operands, the second operand in each case is peeled off Obviously, to achieve the peak rating of four double-precision floating-point and sent to the supplemental SIMD units for execution.) The high com- operations per clock cycle, one needs to feed the FMACs with six operands putational rate on single precision is especially suitable for digital con- per clock. The L2 memory can feed a peak of four operands per clock. The tent creation workloads. remaining two need to come from the register file. Hence, with the right The divide operation is done in software and can take advantage of amount of data reuse, and with appropriate cache management strategies the twin fully pipelined FMAC hardware. Software-pipelined divide oper- aimed at ensuring that the L2 cache is well primed to feed the FPU, many workloads can deliver sustained per- formance at near the peak floating- 2 stores/clock point operation rating. For data 6 × 82 bits without locality, use of the NT2 and NTA hints enables the data to appear to virtually stream into the FPU Register Even through the next level of memory. 4-Mbyte file L2 L3 (128-entry, cache Odd 82 bits) cache 2 double- FPU and integer core precision coupling ops/clock 4 double- precision The floating-point pipeline is cou- ops/clock 2 × 82 bits pled to the integer pipeline. Regis- (2 × ldf-pair) ter file read occurs in the REG stage, Figure B. FMAC units deliver 8 flops/clock. with seven stages of execution

38 IEEE MICRO extending beyond the REG stage, followed by floating-point write back. FPU controls Safe instruction recognition (SIR) hardware enables delivery of precise The FPU controls for operating precision and rounding are derived from exceptions on numeric computation. In the FP1 (or EXE) stage, an early exam- the floating-point (FPSR). This register also contains the ination of operands is performed to determine the possibility of numeric numeric execution status of each operation. The FPSR also supports spec- exceptions on the instructions being issued. If the instructions are unsafe ulation in floating-point computation. Specifically, the register contains (have potential for raising exceptions), a special form of hardware microre- four parallel fields or tracks for both controls and flags to support three par- play is incurred. This mechanism enables instructions in the floating-point allel speculative streams, in addition to the primary stream. and integer pipelines to flow freely in all situations in which no exceptions Special attention has been placed on delivering high performance for are possible. speculative streams. The FPU provides high throughput in cases where the The FPU is coupled to the integer data path via paths between FCLRF instruction is used to clear status from speculative tracks before the integer and floating-point register files. These transfers (setf, getf) forking off a fresh speculative chain. No stalls are incurred on such are issued on the memory ports and made to look like memory operations changes. In addition, the FCHKF instruction (which checks for exceptions (since they need register ports on both the integer and floating-point reg- on speculative chains on a given track) is also supported efficiently. Inter- isters). While setf can be issued on either M0 or M1 ports, getf can only locks on this instruction are track-granular, so that no interlock stalls are be issued on the M0 port. Transfer latency from the FPU to the integer incurred if floating-point instructions in the pipeline are only targeting the registers (getf) is two . The latency for the reverse transfer (setf) is other tracks. However, changes to the control bits in the FPSR (made via nine clocks, since this operation appears like a load from the L2 cache. the FSETC instruction or the MOV GR→FPSR instruction) have a latency We enhanced the FPU to support integer multiply inside the FMAC of seven clock cycles. hardware. Under software control, operands are transferred from the inte- ger registers to the FPU using setf. After multiplication is complete, the FPU summary result is transferred to the integer registers using getf. This sequence The FPU feature set is balanced to deliver high performance across a takes a total of 18 clocks (nine for setf, seven for fmul to write the regis- broad range of computational workloads. This is achieved through the ters, and two for getf). The FPU can execute two integer multiply-add combination of abundant execution resources, ample operand bandwidth, (XMA) operations in parallel. This is very useful in cryptographic appli- and a rich programming environment. cations. The presence of twin XMA pipelines at 800 MHz allows for over 1,000 decryptions per second on a 1,024-bit RSA using private keys (- side encryption/decryption). References 1. B.Olsson et al., “RISC System/6000 Floating-Point Unit,” IBM RISC System/6000 , IBM Corp., 1990, pp. 34-42. in a straightforward manner, only needing management of a small ALAT in hardware. RegID As shown in Figure 9, the ALAT is imple- [3:0] mented as a 32-entry, two-way set-associative (index) structure. The array is looked up based on the 16 sets advanced load’s destination register ID, and Way 0 Way 1 each entry contains an advanced load’s phys- ical address, a special octet mask, and a valid bit. The physical address is used to compare against subsequent stores, with the octet mask bits used to track which bytes have actually RegID tag been advance loaded. These are used in case of Physical address of adv load partial overlap or in cases where the load and Valid bit store are different sizes. In case of a match, the corresponding valid bit is cleared. The later Figure 9. ALAT organization. check instruction then simply queries the ALAT to examine if a valid ALAT entry still exists for the ALAT. for control speculation, eliminates the two The combination of a simple ALAT for the fundamental barriers that software has tradi- data speculation, in conjunction with NaT tionally encountered when boosting instruc- bits and small changes to the exception logic tions. By adding this modest hardware

SEPTEMBER–OCTOBER 2000 39 ITANIUM PROCESSOR

support for speculation, the processor allows effects of later instructions are automatically the software to take advantage of the compil- squashed within the branch er’s large scheduling window to hide memo- itself, preventing any architectural state update ry latency, without the need for complex from branches in the shadow of a taken dynamic scheduling hardware. branch. Given that the powerful branch pre- diction in the front end contains tailored sup- Parallel zero-latency delay-executed branching. port for multiway branch prediction, minimal Achieving the highest levels of performance pipeline disruptions can be expected due to requires a robust control flow mechanism. The this parallel branch execution. processor’s branch-handling strategy is based Finally, the processor optimizes for the com- on three key directions. First, branch seman- mon case of a very short distance between the tics providing more program details are need- branch and the instruction that generates the ed to allow the software to convey complex branch condition. The IA-64 instruction set control flow information to the hardware. Sec- architecture allows a conditional branch to be ond, aggressive use of speculation and predi- issued concurrently with the integer compare cation will progressively lead to an emptying that generates its condition code—no stop bit out of basic blocks, leaving clusters of branch- is needed. To accommodate this important es. Finally, since the data flow from compare to performance optimization, the processor dependent branch is often very tight, special pipelines the compare-branch sequence. The care needs to be taken to enable high perfor- compare instruction is performed in the mance for this important case. The processor pipeline’s EXE stage, with the results being optimizes across all three of these fronts. known by the end of the EXE clock. To The processor efficiently implements the accommodate the delivery of this condition to powerful branch vocabulary of the IA-64 the branch hardware, the processor executes instruction set architecture. The hardware takes all branches in the DET stage. (Note that the advantage of the new semantics for improved presence of the DET stage isn’t an overhead branch handling. For example, the loop count needed solely from branching. This stage is (LC) register indicates the number of iterations also used for exception collection and priori- in a For-type loop, and the epilogue count (EC) tization, and for the second clock of execution register indicates the number of epilogue stages for integer-SIMD operations.) Thus, any in a software-pipelined loop. branch issued in parallel with the compare that By using the loop count information, high generates the condition will be evaluated in performance can be achieved by software the DET stage, using the predicate results cre- pipelining all loops. Moreover, the imple- ated in the previous (EXE) stage. In this man- mentation avoids pipeline flushes for the first ner, the processor can easily handle the case of and last loop iterations, since the actual num- compare and dependent branches issued in ber of iterations is effectively communicated parallel. to the hardware. By examining the epilog As a result of branch execution in the DET count register information, the processor stage, in the rare case of a full pipeline flush automatically generates correct stage predi- due to a branch misprediction, the processor cates for the epilogue iterations of the soft- will incur a branch misprediction penalty of ware-pipelined loop. This step leverages the nine pipeline bubbles. Note that we expect predicate-remapping hardware along with the this to occur rarely, given the aggressive mul- branch prediction information from the loop titier branch prediction strategy in the front count register-based . end. Most branches should be predicted cor- Unlike conventional processors, the Itani- rectly using one of the four progressive resteers um processor can execute up to three parallel in the front end. branches per clock. This is implemented by The combination of enhanced branch examining the three controlling conditions semantics, three-wide parallel branch execu- (either predicates or the loop count/epilog tion, and zero-cycle compare-to-branch laten- count counter values) for the three parallel cy allows the processor to achieve high branches, and performing a priority encode performance on control-flow-dominated to determine the earliest taken branch. All side codes, in addition to its high performance on

40 IEEE MICRO Table 2. Implementation of cache hints.

Hint Semantics L1 response L2 response L3 response NTA Nontemporal (all levels) Don’t allocate Allocate, mark as next replace Don’t allocate NT2 Nontemporal (2 levels) Don’t allocate Allocate, mark as next replace Normal allocation NT1 Nontemporal (1 level) Don’t allocate Normal allocation Normal allocation T1 (default) Temporal Normal allocation Normal allocation Normal allocation Bias Intent to modify Normal allocation Allocate into exclusive state Allocate into exclusive state more computation-oriented data-flow-dom- ter file, using two parallel floating-point load- inated workloads. pair instructions. The third level of on-package cache is 4 Memory subsystem Mbytes in size, uses a 64-byte line size, and is In addition to the high-performance core, four-way set-associative. It communicates the Itanium processor provides a robust cache with the processor at core frequency (800 and memory subsystem, which accommo- MHz) using a 128-bit bus. This cache serves dates a variety of workloads and exploits the the large workloads of server- and transaction- memory hints of the IA-64 ISA. processing applications, and minimizes the cache traffic on the frontside system bus. The Three levels of on-package cache L3 cache also implements a MESI protocol The processor provides three levels of on- for microprocessor coherence. package cache for scalable performance across A two-level hierarchy of TLBs handles vir- a variety of workloads. At the first level, instruc- tual address translations for data accesses. The tion and data caches are split, each 16 Kbytes hierarchy consists of a 32-entry first-level and in size, four-way set-associative, and with a 32- 96-entry second-level TLB, backed by a hard- byte line size. The dual-ported data cache has ware page walker. a load latency of two cycles, is write-through, and is physically addressed and tagged. The L1 Optimal cache management caches are effective on moderate-size workloads To enable optimal use of the cache hierar- and act as a first-level filter for capturing the chy, the IA-64 instruction set architecture immediate locality of large workloads. defines a set of memory locality hints used for The second cache level is 96 Kbytes in size, better managing the memory capacity at spe- is six-way set-associative, and uses a 64-byte cific hierarchy levels. These hints indicate the line size. The cache can handle two requests per temporal locality of each access at each level of clock via banking. This cache is also the level at hierarchy. The processor uses them to deter- which ordering requirements and semaphore mine allocation and replacement strategies for operations are implemented. The L2 cache uses each cache level. Additionally, the IA-64 archi- a four-state MESI (modified, exclusive, shared, tecture allows a bias hint, indicating that the and invalid) protocol for multiprocessor coher- software intends to modify the data of a given ence. The cache is unified, allowing it to ser- cache line. The bias hint brings a line into the vice both instruction and data side requests cache with ownership, thereby optimizing the from the L1 caches. This approach allows opti- MESI protocol latency. mal cache use for both instruction-heavy (serv- Table 2 lists the hint bits and their mapping er) and data-heavy (numeric) workloads. Since to cache behavior. If data is hinted to be non- floating-point workloads often have large data temporal for a particular cache level, that data working sets and are used with compiler opti- is simply not allocated to the cache. (On the L2 mizations such as data blocking, the L2 cache cache, to simplify the control logic, the proces- is the first point of service for floating-point sor implements this algorithm approximately. loads. Also, because floating-point performance The data can be allocated to the cache, but the requires high bandwidth to the register file, the least recently used, or LRU, bits are modified L2 cache can provide four double-precision to mark the line as the next target for replace- operands per clock to the floating-point regis- ment.) Note that the nearest cache level to feed

SEPTEMBER–OCTOBER 2000 41 ITANIUM PROCESSOR

pletion on the bus. A deferred transaction on The Itanium processor is the the bus can be completed without reusing the address bus. This reduces data return latency first IA-64 processor and is for deferred transactions and efficiently uses the address bus. This feature is critical for scal- designed to meet the ability beyond four-processor systems. The 64-bit system bus uses a source-syn- demanding needs of a broad chronous data transfer to achieve 266-Mtrans- fers/s, which enables a bandwidth of 2.1 range of enterprise and Gbytes/s. The combination of these features makes the Itanium processor system a scalable scientific workloads. building block for large multiprocessor sys- tems.

he Itanium processor is the first IA-64 the floating-point unit is the L2 cache. Hence, Tprocessor and is designed to meet the for floating-point loads, the behavior is modi- demanding needs of a broad range of enter- fied to reflect this shift (an NT1 hint on a float- prise and scientific workloads. Through its use ing-point access is treated like an NT2 hint on of EPIC technology, the processor funda- an integer access, and so on). mentally shifts the balance of responsibilities Allowing the software to explicitly provide between software and hardware. The software high-level semantics of the data usage pattern performs global scheduling across the entire enables more efficient use of the on-chip compilation scope, exposing ILP to the hard- memory structures, ultimately leading to ware. The hardware provides abundant exe- higher performance for any given cache size cution resources, manages the bookkeeping and access bandwidth. for EPIC constructs, and focuses on dynam- ic fetch and control flow optimizations to keep System bus the compiled code flowing through the The processor uses a multidrop, shared sys- pipeline at high throughput. The tighter cou- tem bus to provide four-way glueless multi- pling and increased synergy between hardware processor system support. No additional bridges and software enable higher performance with are needed for building up to a four-way sys- a simpler and more efficient design. tem. Systems with eight or more processors are Additionally, the Itanium processor deliv- designed through clusters of these nodes using ers significant value propositions beyond just high-speed interconnects. Note that multidrop performance. These include support for 64 buses are a cost-effective way to build high-per- bits of addressing, reliability for mission-crit- formance four-way systems for commercial ical applications, full IA-32 instruction set transaction processing and e-business work- compatibility in hardware, and scalability loads. These workloads often have highly shared across a range of operating systems and mul- writeable data and demand high throughput tiprocessor platforms. MICRO and low latency on transfers of modified data between caches of multiple processors. References In a four-processor system, the transaction- 1. L. Gwennap, “Merced Shows Innovative based bus protocol allows up to 56 pending Design,” Microprocessor Report, Micro- bus transactions (including 32 read transac- Design Resources, Sunnyvale, Calif., Oct. 6, tions) on the bus at any given time. An 1999, pp. 1, 6-10. advanced MESI coherence protocol helps in 2. M.S. Schlansker and B.R. Rau, “EPIC: reducing bus invalidation transactions and in Explicitly Parallel Instruction Computing,” providing faster access to writeable data. The , Feb. 2000, pp. 37-45. cache-to-cache transfer latency is further 3. T.Y. Yeh and Y.N. Patt, “Two-Level Adaptive improved by an enhanced “defer mechanism,” Training Branch Prediction,” Proc. 24th Ann. which permits efficient out-of-order data Int’l Symp. Microarchitecture, ACM Press, transfers and out-of-order transaction com- New York, Nov. 1991, pp. 51-61.

42 IEEE MICRO 4. L. Gwennap, “New Algorithm Improves Ken Arora was the microarchitect of the exe- Branch Prediction,” Microprocessor Report, cution and pipeline control of the EPIC core Mar. 27, 1995, pp. 17-21. of the Itanium processor. He participated in 5. B.R. Rau et al., “The Cydra 5 Departmental the joint Intel/HP IA-64 EPIC ISA definition : Design Philosophies, and helped develop the initial IA-64 simula- Decisions, and Trade-Offs,” Computer, Jan. tion environment. Earlier, he was a designer 1989, pp. 12-35. and architect on the and Pentium proces- sors. Arora received a BS degree in computer Harsh Sharangpani was Intel’s principal science and a master’s degree in electrical engi- microarchitect on the joint Intel-HP IA-64 neering from Rice University. He holds 12 EPIC ISA definition. He managed the patents in the field of microprocessor archi- microarchitecture definition and validation of tecture and design. the EPIC core of the Itanium processor. He has also worked on the 80386 and i486 proces- Direct questions about this article to Harsh sors, and was the numerics architect of the Pen- Sharangpani, Intel Corporation, Mail Stop tium processor. Sharangpani received an SC 12-402, 2200 Mission College Blvd., MSEE from the University of Southern Cali- Santa Clara, CA 95052; harsh.sharang- fornia, Los Angeles and a BSEE from the Indi- [email protected]. an Institute of Technology, Bombay. He holds 20 patents in the field of .

Introducing the IEEE Computer Society Career Service Center

Career Advance your career Service Search for jobs Center Post a resume • Certification List a job opportunity • Educational Activities Post your company’s profile • Career Information • Career Resources Link to career services • Student Activities • Activities Board http://computer.org/careers/ http://computer.org

SEPTEMBER–OCTOBER 2000 43