<<

2

THE ITANIUM 2 PROCESSOR EXTENDS THE PROCESSING POWER OF THE

ITANIUM PROCESSOR FAMILY WITH A CAPABLE AND BALANCED

MICROARCHITECTURE. EXECUTING UP TO SIX INSTRUCTIONS AT A TIME, IT

PROVIDES BOTH PERFORMANCE AND BINARY COMPATIBILITY FOR ITANIUM-

BASED APPLICATIONS AND OPERATING SYSTEMS.

On 8 July 2002, introduced and requirements associated with Intel’s Itani- the Itanium 2 processor—the Itanium archi- um architecture (formerly called the IA-64 tecture’s second implementation. This event architecture).2 The architecture goes beyond was a milestone in the cooperation between simply defining 64-bit operations and register Intel and Hewlett-Packard to establish the Ita- widths; it defines flexible memory management nium architecture as a key , serv- schemes and several tools that can er, and building block. The use to realize performance. It enables parallel Itanium 2 processor may appear similar to the instruction execution without resorting to Itanium processor, yet it represents significant complex out-of-order designs by advances in performance and scalability. (Sha- explicitly indicating which instructions can rangpani and Arora give an overview of the issue in parallel without data hazards. To that Cameron McNairy Itanium processor.1) These advances result end, three instructions are statically grouped from improvements in frequency, pipeline into 16- bundles. Multiple instruction Intel depth, pipeline control, branch prediction, bundles can execute in parallel, or explicit stops design, and system interface. The can break parallel execution to avoid data haz- microarchitecture design enables the proces- ards. Each bundle encodes a template that indi- Don Soltis sor to effectively address a wide variety of com- cates which type of execution resource the putation needs. instructions require: integer (I), memory (M), Hewlett-Packard Table 1 lists the processor’s main features. We floating point (F), branch (B), and long extend- obtained the Spec FP2000 and Spec CPU2000 ed (LX). Thus, memory, floating-point, and results from http://www.spec.org branch operations that can execute in parallel on 20 February 2002. We obtained the other comprise a bundle with an MFB template. benchmarks from http://developer.intel.com/ The Itanium 2 processor designers took products//processors/server/itanium2/ advantage of explicit parallelism to design index.htm. This site contains relevant infor- an in-order, six-instruction-issue, parallel- mation about the measurement circumstances. execution pipeline. The relatively simple pipeline allowed the design team to focus Microarchitecture overview resources on the memory subsystem’s perfor- Many aspects of the Itanium 2 processor mance and to exploit many of the architecture’s microarchitecture result from opportunities performance opportunities. Figure 1 shows the

44 Published by the IEEE Society 0272-1732/03/$17.00  2003 IEEE core pipeline and the relationship of some Table 1. Features of the Itanium 2 processor. microarchitecture structures to the pipeline. These structures include the instruction buffer, Design which decouples the front end, where instruc- Frequency 1 GHz tion fetch and branch prediction occur, from Pipe stages 8 in-order the back end, where instructions are dispersed Issue/retire 6 instructions and executed. The back-end pipeline renames Execution units 2 integer, 4 memory, 3 branch, 2 floating-point virtual registers to physical registers, accesses files, executes the operation, checks 180 nm for exceptions, and commits the results. Core 40 million L3 cache 180 million transistors Instruction fetch Size 421 mm2 The front-end structures fetch instructions Caches for later use by the back end. The front end L1 instruction Size 16 Kbytes chooses an instruction pointer (IP) from the Latency 1 cycle next linear IP, branch prediction resteer point- Protection Parity ers, or branch misprediction and instruction L1 data Size 16 Kbytes exception resteer pointers. The front end then Latency 1 cycle presents the IP to the instruction cache and Protection Parity translation look-aside buffer (TLB). These L2 Size 256 Kbytes structures are tightly coupled, allowing the Latency 5, 7, or 9+ cycles processor to determine which cache way, if Protection Parity or ECC* any, was a hit, and to deliver the cache con- L3 Size 3 Mbytes tents in the next cycle using an innovation Latency 12+ cycles called prevalidated tags. This is the same idea Protection ECC presented in other Itanium 2 processor Benchmark results descriptions3 in the context of the first-level Spec CPU2000 score 810 data (L1D) cache, but here we discuss it in the Spec FP2000 score 1,431 context of the instruction cache. TPCC (32-way) 433,107 transactions per minute Stream 3,700 Gbytes/s Prevalidated-tag cache design Linpack 10K** 13.94 Gflops Traditional physically addressed cache designs * ECC: error-correcting code require a TLB access to translate a virtual address ** Performed with four processors to a physical address. The cache’s hit detection logic then compares the physical address with the tags stored in each cache way. The serialized The removal of the physical address from translation and comparison typically lead to the hit detection critical path is significant. It multicycle cache designs. In a prevalidated-tag provides an opportunity for a single-cycle cache design, the cache tags do not store a phys- cache, but requires the TLB to be tightly cou- ical address; they store an association to the TLB pled with the cache tags. Another implication entry that holds the appropriate virtual-address is that a miss in the TLB also results in a cache translation. In the Itanium 2 processor, when miss, because no match lines will be driven. the front end presents a virtual address to the Moreover, the number of TLB entries deter- TLB, the cache’s detection logic directly com- mines the number of bits held in each way’s pares the identifier of the entry that matches the tag and might limit the coupled TLB’s size. virtual address, called the match line, with a Figure 2 shows how prevalidated tags tied to one-hot vector stored in the cache tags. The vec- a 32-entry TLB determine a hit. tor indicates which TLB entry holds the trans- lation associated with the contents of that cache L1I cache complex way. This allows a fast determination of which The L1I cache complex comprises the first- cache way of a set, if any, is a hit. The hit result level instruction TLB (L1I TLB), the second- feeds into the way select logic to drive the cache level instruction TLB (L2I TLB), and the contents to the consumer. first-level instruction cache (L1I). The L1I

MARCH–APRIL 2003 45 ITANIUM 2 PROCESSOR

Branch prediction Pipeline IP- Next stages relative address L1l Instruction- L1l IA-32 prediction instruction streaming TLB engine cache buffer IPG IPG L2l TLB Front end IP-relative address Instruction buffer: and return stack buffer 8 bundles (24 instructions) ROT

Pattern Instruction decode and dispersal history MM M M I I F F B B B EXP

Register Integer FP stack engine renamer renamer REN

Scoreboard and Integer FP hazard detection register file REG

Back L2D ALAT L1D Integer end FP1

Branch EXE TLB 32 entries cache ALU (6) Pipeline control Integer L2 (6) tags FP2 DET

Hardware FP3

walker WRB L3 cache Floating point (2) L2 and cache

system FP4 interface

ALAT Advanced-load address table REG Register file read TLB Translation look-aside buffer EXE ALU execution IPG Instruction pointer generation and fetch DET Exception detection ROT Instruction rotation WRB Write back EXP Instruction template decode, expand, and disperse FPx Floating-point pipe stage REN Rename (for register stack and rotating registers) and decode

Figure 1. Itanium 2 processor pipeline.

TLB and the L1I cache are arranged as The 64-byte L1I cache line holds four required for a prevalidated-tag design. The instruction bundles. The L1I can sustain a four-way set-associative L1I cache is 16 Kbytes stream of one 32-byte read per cycle to pro- in size, relatively small because of latency and vide two bundles per cycle to the back-end area design constraints but still optimal. An pipeline. The fetched bundles go directly to instruction prefetch engine enhances the the dispersal logic or into an instruction buffer cache’s effective size. The dual-ported tags and for later consumption. If the instruction TLB resolve demand and prefetch requests buffer is full, the front-end pipeline stalls. without conflict. The page offset of the virtu- The L1I TLB directly supports only a 4- al-address bits selects a set from the tag array Kbyte page size. The L1I TLB indirectly sup- and the data array for demand accesses. The ports larger page sizes by allocating additional upper bits of the virtual address determine entries as each 4-Kbyte segment of the larger which, if any, way holds the requested instruc- page is referenced. An L1I TLB miss implies a tions. The tag and TLB lookup results deter- miss in the L1I cache and can initiate L2I TLB mine a L1I hit or miss, as described earlier. and second-level (L2) cache accesses, as well as

46 IEEE MICRO a of page information to the L1I TLB. TLB Cache The L2I TLB is a 128-entry, fully associative 0 structure with a single port. Each entry can Virtual address 1 0 0 0 1 0 represent all page sizes defined in the architec- Virtual address 2 1 0 0 0 ture from 4 Kbytes to 4 Gbytes. Up to 64 1 entries can be pinned as translation registers Virtual address 3 0 1 0 0 0 to ensure that hot pages are always available. Virtual address 4 0 0 0 0 In the event of an L2I TLB miss, the L2I TLB

0 Hit comparator requests the hardware page walker (HPW) to 0 fetch a translation from the virtual hashed page Virtual address 32 0 0 1 0 table. If a translation is available, the HPW inserts it into the L2I TLB. If a translation is Virtual address 3 Way 2 Way 1 Way 2 Way 3 Way 4 not available or the HPW aborts, an exception occurs and the assumes con- Figure 2. Prevalidated cache tags tied to the TLB determine a hit. The present- trol to establish a mapping for the reference. ed virtual address is TLB entry 3. The TLB drives a match line indicating the match to the hit comparator, which reads and compares the way’s tags Instruction-streaming buffer against this match line. The tag in way 2 matches the match line, so way 2 is The instruction-streaming buffer augments reported as a hit. the instruction cache. The ISB holds eight L1I cache lines of instructions returned from the L2 or higher cache levels. It also stores virtual ate a streaming prefetch. For these hints, the addresses that are scanned by the ISB hit detec- prefetch engine continues to fetch along a lin- tion logic for each IP presented to the L1I cache. ear path, up to four L2 cache lines ahead of An ISB hit has the same one-cycle latency as a demand accesses. hints can explicit- normal L1I cache hit. Instructions typically ly stop the current streaming prefetch or spend little time in the ISB because the L1I engage a new streaming prefetch. The prefetch cache can usually support reads and fills in the engine automatically stops prefetching down same cycle. The ISB enables branch prediction, a path if a mispredicted branch resteers the instruction demand accesses, and instruction front end. The prefetch engine avoids cache prefetch accesses to occur without conflict. pollution through software hints, branch- prediction-based cancellation, self-throttle Instruction prefetching mechanisms, and an L1I cache line replace- Software can engage the instruction ment algorithm that biases unreferenced prefetch engine to reduce the instruction instructions for replacement. cache miss count and the associated penalty. The architecture defines hint instructions that Branch prediction provide the hardware early information about The Itanium 2 processor’s branch predic- a future branch. In the Itanium 2 processor, tion performance relies on a two-level predic- these instructions direct the instruction tion algorithm and two levels of branch prefetch engine to prefetch one or many L2 history storage. The first level of branch pre- cache lines. The virtual address of the desired diction storage is tightly coupled to the L1I instructions allocates into the eight-entry cache. This coupling allows a branch’s prefetch virtual address buffer. Addresses from taken/not taken history and a predicted tar- this buffer access the L1I TLB and L1I cache get to be delivered with every L1I demand tags through the prefetch port, keeping access in one cycle. The branch prediction prefetch requests from interfering with criti- logic uses the history to access a pattern his- cal instruction access. If the instructions tory table and determine a branch’s final already exist in the L1I cache, the address is taken/not taken prediction, or trigger, accord- removed from the address buffer. If the ing to the Yeh-Patt algorithm.4 The L2 branch instructions are missing, the prefetch engine cache saves the histories and triggers of sends a prefetch request to the L2 cache. branches evicted from the L1I so that they are The prefetch engine also supports a special available when the branch is revisited, pro- prefetch hint on branch instructions to initi- viding the second storage level.

MARCH–APRIL 2003 47 ITANIUM 2 PROCESSOR

Table 2. Possible branch prediction penalties and their causes. A information suggests that branch prediction correctly predicted taken branch incurs no penalty. accuracy suffers when the instruction stream revisits a branch that has lost its prediction Penalty (cycles) Cause history because of an eviction. To mitigate the 1 Correctly predicted taken IP-relative branch with potential loss of branch histories, the L2 incorrect target and return branch branch cache stores the trigger and histories 2 Nonreturn of branches evicted from the first-level stor- 6 Incorrect taken/not taken prediction or incorrect age. The L2B is a 24,000-entry backing store indirect target that does not use tags; instead it uses three address-based hashing functions and voting to determine the correct initialization of pre- The one-cycle latency provides a zero- diction histories and triggers for L1I fills. Lim- penalty resteer for correctly predicted IP- iting the L2B to prediction history and trigger relative branches. The prediction information but not target provides a highly effective and consists of the prediction history and trigger compact design. A branch target can be recal- for every branch instruction, up to three per culated, in most cases, before a L1I fill occurs bundle, and a portion of the predicted target’s and with little penalty. It is possible that the virtual address for every bundle pair. Because L2B does not contain any information for the the bundles share the target and the target may line being filled to L1I. In that case, the trig- not be sufficient to represent the entire span ger and history bits are initialized according required by the branch, there might be times to the branch completers provided in the when the front end is resteered to an incor- branch instruction. rect address. The branch prediction logic tracks this situation and provides a corrected Instruction buffer IP-relative target one cycle later. The instruction buffer receives instructions from the L1I or L2 caches and lets the front Return stack buffer and indirect branches end fetch instructions ahead of the back-end All predictions for return branches come pipeline’s consumption of instructions. This from an eight-entry return stack buffer. A eight-bundle buffer and bundle rotator can branch call pushes both the caller’s IP and its present a wide combination of two-instruc- current function onto the RSB. A return tion bundles to back-end dispersal logic. branch pops off this information. The RSB Thus, no matter how many instructions the predictions resteer the front end two cycles back end consumes in a cycle, two bundles after the cache lookup that contains the return of instructions are available. The dispersal branch. logic indicates that zero, one, or two bundles The branch prediction logic predicts indi- were consumed so that the instruction buffer rect branch targets on the basis of the current can free the appropriate entries. If the value in the referenced branch register three pipeline is flushed or the instruction buffer cycles after the cache lookup that contains the is empty, a bundle can bypass the instruction indirect branch. buffer completely. Branch resolution Instruction dispersal All branch predictions are validated in the Figure 3 shows the design of the Itanium 2 back-end pipeline. The branch prediction processor front end and dispersal logic. The logic allows in-flight branch prediction to processor can issue and execute two instruc- determine future branch prediction behavior; tion bundles, or six instructions, at a time. however, nonspeculative prediction state is These instructions issue to one of 11 issue maintained and restored in the case of a mis- ports: prediction. Table 2 lists the possible branch prediction penalties and their causes. • two integer, • four memory, L2 branch cache • two floating-point, and The size and organization of the branch • three branch.

48 IEEE MICRO Issue ports

M Prefetch virtual-addressPVAB Instruction buffer L1l bundles M target

+32 history L1I tag 0 IP next

Hit M Pattern history Next prediction L1I TLB Hit 1 M L1l array 2 F

Instruction buffer F L2 cache Instruction- streaming buffer 0 I L2 histories

1 I

2 B

B

B

Figure 3. Itanium 2 processor front-end and dispersal-logic design.

These ports allocate instructions to several template type (I, M, F, B, and LX), and the execution units. Two integer units execute dispersal logic typically assigns the first I integer operations such as shift and extract; instruction to the first I resource, the second I ALU operations such as add, and, and com- instruction to the second I resource, and so on pare; and multimedia ALU operations. Four until it exhausts the resources or an explicit memory units execute memory operations stop bit breaks up an issue group. If instruc- such as load, store, semaphore, and prefetch, in tions in the two bundles considered require to the ALU and multimedia instruc- more resources than available, the issue group tions that the integer units can execute. The stops at the oversubscription point, and the four memory units are slightly asymmetric— remaining instructions wait for dispersal in the two are dedicated to integer loads and two to next cycle. The instruction in an issue group is stores. Compared with a two-memory-port determined at dispersal and remains constant implementation, the four memory ports pro- through the in-order execution pipeline. vide a threefold increase in dual-issue template The dispersal logic dynamically maps combinations and many other performance instructions to the most appropriate resource. improvement opportunities.5 This is important in cases of limited or asym- The processor’s dispersal logic looks at two metric execution resources. For example, the bundles of instructions every cycle and assigns dispersal logic assigns a load instruction to the as many instructions as possible to execution first load-capable M port (M0 or M1) and a resources. There are multiple resources for each store to the first store-capable M port (M2 or

MARCH–APRIL 2003 49 ITANIUM 2 PROCESSOR

M3) even if the store precedes the load in the Scoreboard and hazard detection issue group. In addition, the dispersal logic The Itanium 2 processor’s scoreboard ignores this asymmetry for floating-point loads mechanism enables high performance in the so that they issue to any M resource. Dynam- face of L1D misses. Once physical register ic resource mapping also lets instructions typ- identifiers are available from the register- ically assigned to I resources issue on M renaming logic, the hazard detection logic resources. If the template assigns an ALU or compares them with the register scoreboard, multimedia operation to an I resource, but all which lists registers associated with earlier I resources have been exhausted, the dispersal L1D misses. If an instruction source or desti- logic dynamically reassigns the operation to an nation register matches a scoreboarded regis- available M resource. Thus, the processor can ter, the issue group stalls at the execute (EXE) often issue a pair of MII bundles despite hav- stage until the register value becomes avail- ing only two I resources. These capabilities able. This feature facilitates nonblocking- remove the burden of ordering and padding cache designs, in which multiple cache misses instructions to ensure that they issue to cor- can be outstanding and yet the processor can rect resources from the code generator. continue to execute until the instruction stream references a register in the scoreboard. Register stack engine and A similar mechanism exists for other long- The Itanium 2 processor implements 128 execution-latency operations such as floating- integer registers, a register stack engine (RSE), point and multimedia operations. For them, and register renaming. These features work latency is fixed at four and two cycles, respec- together to give software the perception of tively. The hazard detection logic tracks the unlimited registers. The RSE maintains the operation type and destination register and register set currently visible to the application compares each instruction source and desti- by saving registers to and restoring registers nation register with these in-flight operations. from a backing store. Software can allocate Like a scoreboard match, a long-latency- registers as needed, and the RSE can stall the operation match stalls the entire issue group. pipeline to write a dirty register’s value to memory (backing store) to make room for the Integer execution and bypass newly allocated registers. A branch return The six execution units supporting integer, instruction can also engage the RSE if the reg- multimedia, and ALU operations are fully isters required by the return frame are not bypassed; that is, as soon as an available and must be loaded from the back- calculates a result, the result becomes avail- ing store. The RSE significantly reduces the able for use by another instruction on any number of memory operations required by other execution unit. A producer-and- software for function and system calls. The consumer dependency matrix, considering larger register file and the RSE implementa- latencies and instruction types, controls the tion keep the amount of time applications bypass network. Twelve read ports and eight spend saving and restoring registers low.6 write ports on the integer register file and 20 The register-renaming logic manages reg- bypass choices support highly parallel execu- isters across calls and returns and enables effi- tion.7 Six of the eight write ports are for cal- cient software-pipelined code. The logic culation results, and the other two provide works with branch instructions to provide a write paths for load returns from the L1D new set of registers to a software-pipelined cache. All ALU and integer operations com- loop through register rotation. Compilers plete in one cycle. can software-pipeline many floating-point and a growing number of integer loops, Floating-point execution resulting in significant code and execution Each of the two floating-point execution efficiencies. In addition, the static nature of units can execute a fused multiply-add or a renaming, from RSE engagement onward, miscellaneous floating-point operation. means that this capability is far simpler than Latency is fixed at four cycles for all floating- the register renaming performed by out-of- point calculations. The units are fully order implementations. pipelined and bypassed. Eight read and six

50 IEEE MICRO write ports access the 128 floating-point reg- Table 3. Potential pipeline stalls. isters. Six of the read ports supply operands for calculation; the remaining two read ports Stage Cause of stall are for floating-point store operations. Two of Rename RSE activity required the write ports are for calculation results; the Execute Scoreboard and hazard detection logic other four provide write paths for floating- Exception detect L2D TLB miss; L2 cache resources unavailable for point load returns from the L2 cache. The memory operations; floating-point and integer pipeline four M resources and the two F resources coordination to avoid possible floating-point traps; or L1D combined allow two MMF bundles to exe- and integer pipeline coordination cute every cycle. This provides the memory and computational bandwidth required for technical computing.5

M0 L1D TLB Pipeline control M1 Hit System interface The Itanium 2 processor L1D tag FP

pipeline is fully interlocked Integer L1D array such that a stall in the excep- 16 Kbytes tion detect (DET) stage prop- agates to the instruction L2 data L3 data expand (EXP) stage and sus- 256 Kbytes 3 Mbytes M2 Store buffer queues pends instruction advance- M3 ment. A stall caused by one instruction in the issue group L1D store tag L2 tags L3 tags stalls the entire issue group and never causes the core L2D TLB Fill buffer pipeline to flush and replay. 128 entries The DET-stage stall is the last opportunity for an instruc- tion to halt execution before Data path ECC the pipeline control logic Address/control Parity commits it to architectural Multi-hit state. The pipeline control logic also synchronizes the Figure 4. Itanium 2 processor’s memory subsystem and system interface. core pipeline and the L1D pipeline at the DET stage. The control logic allows these loosely coupled floating-point, and enterprise workloads.5 Fig- pipelines to lose so that the ure 4 shows a simplified diagram of the mem- L1I and L2 caches can insert noncore requests ory subsystem and system interface, including into the memory pipeline with minimal some data and control paths and data integri- impact on core instruction execution. Table ty features. 3 lists the stages and causes of potential stalls. Advanced-load address table Memory subsystem The advanced-load address table (ALAT) The relatively simple nature of the in-order provides the ability to transform multicycle core pipeline allowed the Itanium 2 processor load accesses into zero-cycle accesses through designers to focus on the memory subsystem. dynamic . The The processor implements a full complement pipeline and scoreboard designs encourage of region identifiers and protection keys, along scheduling loads as far ahead of use as possi- with 64 bits of virtual address and 50 bits of ble to avoid core pipeline stalls. Unknown physical address to provide 1,024 Tbytes of data dependencies normally prevent a com- addressability. The memory subsystem is a piler from scheduling a load much earlier than low-latency, high-bandwidth design parti- its use. The ALAT enables a load to advance tioned and organized to handle integer, beyond an unknown data dependency by

MARCH–APRIL 2003 51 ITANIUM 2 PROCESSOR

resolving that dependency dynamically. An The prevalidated tags and first-level TLB advanced load allocates an entry in the ALAT, serve only integer loads. Stores access the sec- a four-ported, 32-entry, fully associative struc- ond-level data (L2D) TLB and use a tradi- ture that records the register identifiers and tional tagging mechanism. This increases their physical addresses of advanced loads. A later latency, but store latency is not a performance store to the same address invalidates all over- issue, in part because store-load forwarding is lapping ALAT entries. Later, when an instruc- provided in the store data path. The L1D tion requires the load’s result, the ALAT enforces a write-through with a no-write- indicates whether the load is still valid. If so, allocate policy such that it passes all stores to a use of the load data is allowed in the same the L2 cache, and store misses do not allocate cycle as the check without penalty. If a valid into the L1D. If a store hits in the L1D, the entry is not found, the load is automatically data moves to a store buffer until the data array reissued and the use is replayed. The Itanium becomes available to update the L1D. These architecture allows scheduling of a use in the store buffers can merge store data from other same issue group as the check; hence, from stores and forward their contents to later loads. the code scheduler’s perspective, an ALAT hit Integer load and data prefetch misses allo- has zero latency. cate into the L1D, according to temporal hints and available resources. Up to eight L1D lines L1D cache can have fill requests outstanding, but the total The data TLBs and L1D cache are similar number of permitted L1D misses is limited in design to the instruction TLBs and caches; only by the scoreboard and the other cache lev- they share key attributes such as size, latency, els. If the L2 cannot accept a request, it applies arrangement, and tight integration with tags back pressure and the core pipeline stalls. Before and the first-level TLB. The principle of an L1D load miss or store request is dispatched prevalidated tags enables a one-cycle L1D to the L2, it accesses the L2D TLB. The TLB cache. This feature is essential for a wide in- access behavior for loads differs from that of order to achieve high perfor- the instruction cache: The L1D and L2D TLBs mance in many integer workloads. If the are accessed in parallel for loads, regardless of an latency were two cycles, the would L1D hit or miss. This reduces both L1D and need to schedule at least five, and often more, L2 latency. Consequently, the 128-entry, fully instructions to cover the latency. The Itani- associative L2D TLB is fully four-ported to um 2 processor’s single-cycle latency requires allow the complete issue of every possible com- only an explicit stop between a load and its bination of four memory operations. use, thus easing the burden on the code gen- The L1D is highly integrated into the inte- erator to extract instruction-level parallelism. ger data path and the L2 tags. All integer loads The L1D is a multiported, 16-Kbyte, four- must go through the L1D to return data to the way set-associative, physically addressed cache register file and core bypass network. The L1D with a 64-byte line protected by parity. pipeline processes all memory accesses and Instructions access the L1D in program order; requests that need access to the L2 tags or the hence, it is an in-order cache. However, the integer register file. Accordingly, several types of scoreboard logic allows the L1D and other requests arbitrate for access to the L1D. Some cache levels to be nonblocking. The L1D pro- of these requests have higher priority than core vides two dedicated load ports and two dedi- requests, and if there are conflicts, the core cated store ports. These ports are fixed, but memory request stalls the core and reissues to the dispersal logic rearranges loads and stores the L1D when resources are available. within an issue group to ensure they reach the appropriate memory resource. The two load L2 cache requests can hit and return data from the L1D The second-level (L2) cache is a unified, in parallel without conflict. Rotators between 256-Kbyte, eight-way set-associative cache the data array and the register file allow inte- with a 128-byte line size. The L2 tags are true ger loads to any unaligned data reference with- four-ported, with tag and ownership state pro- in an 8-byte datum, as well as support for big- tected by parity. The tags, accessed as part of or little-endian accesses. the L1D pipeline, provide an early L2 hit or

52 IEEE MICRO miss indication. The L2 enforces write-back L1D, which has two data paths to the integer and write-allocate policies. The L2’s integer units and register file. Stores can bypass or access latency is five, seven, nine, or more issue from the L2 OzQ and access the L2 data cycles. Floating-point accesses require an addi- array four at a time, provided they access dif- tional cycle for converting to the floating- ferent banks. point register format. The fill path width from the L2 to the L1D The L2 cache is nonblocking and out of and the L1I is 32 , requiring two cycles order. All memory operations that access the to transfer a 64-byte L1I or L1D line. The fill L2 (L1D misses and all stores) check the L2 bandwidth from the L3 or system interface to tags and are allocated to a 32-entry queuing the L2 is also 32 bytes per cycle. Four 32-byte structure called the L2 OzQ. All stores require quantities accumulate in the L2 fill buffers for one of the 24 L2 data entries to hold the store either the L3 or system interface, allowing the until the L2 data array is updated. L1I instruc- interleaving of system interface and L3 data tion misses also go to the L2 but are stored in returns. The 128-byte cache line is written the instruction fetch FIFO (IFF) queue. into the L2 in one cycle, updating both tag Requests in the L2 OzQ and the IFF queue and data arrays. arbitrate for access to the data array or the L3 cache and system interface. This arbitration L3 cache and system interface depends on the type of IFF request; instruc- All cacheable requests that the L2 cannot tion demand requests issue before data satisfy arrive at the L3 and the system inter- requests, and data requests issue before face. The L2 can make partial line requests of instruction prefetch requests. Up to four L2 the system interface for uncacheable and data operations and one request to the L3 and write-coalescing accesses, but all cacheable system interface can issue every cycle. requests are 128-byte accesses. The L2 can The L2 OzQ maintains all architectural make one request to the L3 and the system ordering between memory operations, while interface per cycle. Requests enter one of the allowing unordered accesses to complete out 16 bus request queues (BRQs) maintained by of order. This makes specifying a single L2 the system interface control logic. The BRQ latency difficult but helps ensure that older can then send each request to the L3 to deter- memory operations do not impede the mine whether the L3 can satisfy the request. progress of younger ones. In many cases, To lower the L3 access latency, an L2 request incoming requests bypass allocation to the L2 can bypass the BRQ allocation and query the OzQ and access the data array immediately. L3 immediately. If the request is an L3 miss, This provides the five-cycle latency mentioned it is scheduled to access the system interface. earlier. Sometimes the request can bypass the When the system interface responds with the OzQ, but an L2 resource conflict forces the data, the line is written to the L2 and the L3 request to have a seven-cycle latency. The min- in accordance with its temporal locality hints imum latency for a request that issues from and access type. the L2 OzQ is nine cycles. Resource conflicts, ordering requirements, or higher-priority L3 cache. The third-level cache is a unified, 3- operations can extend a request’s latency Mbyte, 12-way set-associative cache with a beyond nine cycles. 128-byte line size. Its access latency can be as The L2 data array has 16 banks; each bank low as 12 cycles, largely because the entire is 16 bytes wide and ECC-protected. The cache is on chip. All L3 accesses return an array allows multiple simultaneous accesses, entire 128-byte line; the L3 doesn’t support provided each access is to a different bank. partial line accesses. The single-ported L3 tag Floating-point loads can bypass or issue from array has ECC protection and is pipelined to the L2 OzQ, access the L2 data array, com- allow a new tag access every cycle. The L3 data plete four requests at a time, and fully utilize array is also single-ported and has ECC pro- the L2’s four data paths to the floating-point tection, but it requires four cycles to transfer units and register file. The L2 does not have a full data line to the L2 cache or the system direct data paths to the integer units and reg- interface. Requests that fill the L3 require four ister file; integer loads deliver data via the cycles to transfer data, using a separate data

MARCH–APRIL 2003 53 ITANIUM 2 PROCESSOR

path so that L3 reads and writes can be to system memory. This allows higher- pipelined for maximum bandwidth. The L3 performance processor check pointing in is nonblocking and has an eight-entry queue high-availability systems without forcing the to support multiple outstanding requests. processor to give up ownership of the line. This queue orders requests and prioritizes them among tag read or write and data read or oon after the Itanium 2 processor’s intro- write to achieve the highest performance. Sduction, major computer system providers either announced or introduced single- and System interface. The system interface oper- dual-processor , four- to 128- ates at 200 MHz and includes multiple sub- processor servers, and a 3,300-processor buses for various functions, such as supercomputer, all using the Itanium 2 address/request, snoop, response, data, and processor. The operating systems available for defer. All buses, except the snoop bus, are pro- these systems include HPUX, , and tected against errors by parity or ECC. The Windows .NET, and will eventually include data bus is 128 bits wide and operates source- OpenVMS. These systems and operating sys- synchronously at 400 million data transfers, or tems target a diverse set of computing prob- 6.4 Gbytes, per second. The system interface lems and use the processor effectively for seamlessly supports up to four Itanium 2 workstation, server, and supercomputer work- processors. loads. The Itanium 2 processor fits well in The system interface control logic contains such varied environments because of its bal- an in-order queue (IOQ) and an out-of-order anced design from instruction fetch to system queue (OOQ), which track all transactions interface and its flexible underlying architec- pending completion on the system interface. ture. The design team capitalized on the per- The IOQ tracks a request’s in-order phases formance opportunities available in the and is identical on all processors and the node Itanium architecture to produce a high- controller. The OOQ holds only deferred performance, in-order implementation and processor requests. The IOQ can hold eight provide computer system developers a pow- requests, and the OOQ can hold 18 requests. erful and versatile building block. MICRO The system interface logic also contains two 128-byte coalescing buffers to support write- References coalescing stores. The buffers can coalesce 1. H. Sharangpani and K. Arora, “Itanium store requests at byte granularity, and they Processor Microarchitecture,” IEEE Micro, strive to generate full line writes for best per- vol. 20, no. 5, Sept.-Oct. 2000, pp. 24-43. formance. Writes of 1 to 8 bytes, 16 bytes, or 2. J. Huck et al., “Introducing the IA-64 32 bytes are possible when holes exist in the Architecture” IEEE Micro, vol. 20, no. 5, coalescing buffers. Sept.-Oct. 2000, pp. 12-23. The similarities between the system inter- 3. D. Bradley, P. Mahoney, and B. Stackhouse, faces of the Itanium 2 and Itanium proces- “The 16KB Single-Cycle Read Access Cache sors allowed several implementations to on a Next Generation 64b Itanium leverage their Itanium-based solutions for use Microprocessor,” Proc. 2002 IEEE Int’l Solid- with the Itanium 2 processor. However, large, State Circuits Conf. (ISSCC 02), IEEE Press, multinode system designs required addition- 2002, pp. 110-111. al support for high performance and reliabil- 4. T.-Y. Yeh and Y.N. Patt, “Alternative ity. As a result, the processors’ system interface Implementations of Two-Level Adaptive defines a few new transactions. The read cur- Branch Prediction,” Proc. 19th Int’l Symp. rent transaction lets the node controller obtain (ISCA 92), ACM a current copy of data in a processor, while Press, 1992, pp. 124-134. allowing the processor to maintain ownership 5. T. Lyon et al., “Data Cache Design of the line. The cache line replacement trans- Considerations for the Itanium 2 Processor,” action informs a multinode snoop directory Proc. 2002 IEEE Int’l Conf. Computer that an L3 clean eviction occurred to remove Design: VLSI in and Processors unnecessary snoop traffic. The cleanse cache (ICCD 02), IEEE Press, 2002, pp. 356-362. transaction pushes a modified cache line out 6. J. McCormick and A. Knies, “A Brief

54 IEEE MICRO Analysis of the SPEC CPU2000 Benchmarks Don Soltis is an Itanium 2 processor micro- on the Intel Itanium 2 Processor,” 2002; architect at Hewlett-Packard. His research inter- http://www.hotchips.org/archive/index.html. ests include microprocessor cache design and 7. E.S. Fetzer and J.T. Orton, “A Fully-Bypassed microprocessor verification. Soltis has a BSEE 6-Issue Integer and Register File on and an MSEE from Colorado State University. an Itanium Microprocessor,” Proc. 2002 IEEE Int’l Sold-State Circuits Conf. (ISSCC Direct questions or comments about this 02), IEEE Press, 2002, pp. 420-478. article to Cameron McNairy, 3400 E. Har- mony Road, MS 55, Fort Collins, CO 80526; Cameron McNairy is an Itanium 2 processor [email protected]. microarchitect at Intel. His research interests include high-performance technical comput- ing and large-system design issues. McNairy For further information on this or any other has a BSEE and an MSEE from Brigham computing topic, visit our Digital Library at Young University. He is a member of the IEEE. http://computer.org/publications/dlib.

Computer IT Professional Agile Software Development Financial Market IT Piracy & Privacy IEEE Micro IEEE & Applications 14 3D Reconstruction & Visualization IEEE MultiMedia Computing in Science & Engineering Computational Media Aesthetics The End of Moore’s Law IEEE Software IEEE Design & Test Software Geriatrics: Clockless VLSI Design Planning the Whole Life Cycle IEEE Intelligent Systems IEEE Security & Privacy AI & Elder Care Digital Rights Management IEEE Computing IEEE Pervasive Computing The Semantic Web Smart Spaces

MARCH–APRIL 2003 55