Itanium 2 Processor Microarchitecture

ITANIUM 2 PROCESSOR MICROARCHITECTURE THE ITANIUM 2 PROCESSOR EXTENDS THE PROCESSING POWER OF THE ITANIUM PROCESSOR FAMILY WITH A CAPABLE AND BALANCED MICROARCHITECTURE. EXECUTING UP TO SIX INSTRUCTIONS AT A TIME, IT PROVIDES BOTH PERFORMANCE AND BINARY COMPATIBILITY FOR ITANIUM- BASED APPLICATIONS AND OPERATING SYSTEMS. On 8 July 2002, Intel introduced and requirements associated with Intel’s Itani- the Itanium 2 processor—the Itanium archi- um architecture (formerly called the IA-64 tecture’s second implementation. This event architecture).2 The architecture goes beyond was a milestone in the cooperation between simply defining 64-bit operations and register Intel and Hewlett-Packard to establish the Ita- widths; it defines flexible memory management nium architecture as a key workstation, serv- schemes and several tools that compilers can er, and supercomputer building block. The use to realize performance. It enables parallel Itanium 2 processor may appear similar to the instruction execution without resorting to Itanium processor, yet it represents significant complex out-of-order pipeline designs by advances in performance and scalability. (Sha- explicitly indicating which instructions can rangpani and Arora give an overview of the issue in parallel without data hazards. To that Cameron McNairy Itanium processor.1) These advances result end, three instructions are statically grouped from improvements in frequency, pipeline into 16-byte bundles. Multiple instruction Intel depth, pipeline control, branch prediction, bundles can execute in parallel, or explicit stops cache design, and system interface. The can break parallel execution to avoid data haz- microarchitecture design enables the proces- ards. Each bundle encodes a template that indi- Don Soltis sor to effectively address a wide variety of com- cates which type of execution resource the putation needs. instructions require: integer (I), memory (M), Hewlett-Packard Table 1 lists the processor’s main features. We floating point (F), branch (B), and long extend- obtained the Spec FP2000 and Spec CPU2000 ed (LX). Thus, memory, floating-point, and benchmark results from http://www.spec.org branch operations that can execute in parallel on 20 February 2002. We obtained the other comprise a bundle with an MFB template. benchmarks from http://developer.intel.com/ The Itanium 2 processor designers took products/server/processors/server/itanium2/ advantage of explicit parallelism to design index.htm. This site contains relevant infor- an in-order, six-instruction-issue, parallel- mation about the measurement circumstances. execution pipeline. The relatively simple pipeline allowed the design team to focus Microarchitecture overview resources on the memory subsystem’s perfor- Many aspects of the Itanium 2 processor mance and to exploit many of the architecture’s microarchitecture result from opportunities performance opportunities. Figure 1 shows the 44 Published by the IEEE Computer Society 0272-1732/03/$17.00 2003 IEEE core pipeline and the relationship of some Table 1. Features of the Itanium 2 processor. microarchitecture structures to the pipeline. These structures include the instruction buffer, Design which decouples the front end, where instruc- Frequency 1 GHz tion fetch and branch prediction occur, from Pipe stages 8 in-order the back end, where instructions are dispersed Issue/retire 6 instructions and executed. The back-end pipeline renames Execution units 2 integer, 4 memory, 3 branch, 2 floating-point virtual registers to physical registers, accesses Silicon the register files, executes the operation, checks Technology 180 nm for exceptions, and commits the results. Core 40 million transistors L3 cache 180 million transistors Instruction fetch Size 421 mm2 The front-end structures fetch instructions Caches for later use by the back end. The front end L1 instruction Size 16 Kbytes chooses an instruction pointer (IP) from the Latency 1 cycle next linear IP, branch prediction resteer point- Protection Parity ers, or branch misprediction and instruction L1 data Size 16 Kbytes exception resteer pointers. The front end then Latency 1 cycle presents the IP to the instruction cache and Protection Parity translation look-aside buffer (TLB). These L2 Size 256 Kbytes structures are tightly coupled, allowing the Latency 5, 7, or 9+ cycles processor to determine which cache way, if Protection Parity or ECC* any, was a hit, and to deliver the cache con- L3 Size 3 Mbytes tents in the next cycle using an innovation Latency 12+ cycles called prevalidated tags. This is the same idea Protection ECC presented in other Itanium 2 processor Benchmark results descriptions3 in the context of the first-level Spec CPU2000 score 810 data (L1D) cache, but here we discuss it in the Spec FP2000 score 1,431 context of the instruction cache. TPCC (32-way) 433,107 transactions per minute Stream 3,700 Gbytes/s Prevalidated-tag cache design Linpack 10K** 13.94 Gflops Traditional physically addressed cache designs * ECC: error-correcting code require a TLB access to translate a virtual address ** Performed with four processors to a physical address. The cache’s hit detection logic then compares the physical address with the tags stored in each cache way. The serialized The removal of the physical address from translation and comparison typically lead to the hit detection critical path is significant. It multicycle cache designs. In a prevalidated-tag provides an opportunity for a single-cycle cache design, the cache tags do not store a phys- cache, but requires the TLB to be tightly cou- ical address; they store an association to the TLB pled with the cache tags. Another implication entry that holds the appropriate virtual-address is that a miss in the TLB also results in a cache translation. In the Itanium 2 processor, when miss, because no match lines will be driven. the front end presents a virtual address to the Moreover, the number of TLB entries deter- TLB, the cache’s detection logic directly com- mines the number of bits held in each way’s pares the identifier of the entry that matches the tag and might limit the coupled TLB’s size. virtual address, called the match line, with a Figure 2 shows how prevalidated tags tied to one-hot vector stored in the cache tags. The vec- a 32-entry TLB determine a hit. tor indicates which TLB entry holds the translation associated with the contents of that cache L1I cache complex way. This allows a fast determination of which The L1I cache complex comprises the first- cache way of a set, if any, is a hit. The hit result level instruction TLB (L1I TLB), the second- feeds into the way select logic to drive the cache level instruction TLB (L2I TLB), and the contents to the consumer. first-level instruction cache (L1I). The L1I MARCH–APRIL 2003 45 ITANIUM 2 PROCESSOR Branch prediction Pipeline IP- Next stages relative address L1l Instruction- L1l IA-32 prediction instruction streaming TLB engine cache buffer IPG IPG L2l TLB Front end IP-relative address Instruction buffer: and return stack buffer 8 bundles (24 instructions) ROT Pattern Instruction decode and dispersal history MM M M I I F F B B B EXP Register Integer FP stack engine renamer renamer REN Scoreboard and Integer FP hazard detection register file register file REG Back L2D ALAT L1D Integer end FP1 Branch EXE TLB 32 entries cache ALU (6) Pipeline control Integer multimedia L2 (6) tags FP2 DET Hardware page FP3 walker WRB L3 cache Floating point (2) L2 and cache system FP4 interface ALAT Advanced-load address table REG Register file read TLB Translation look-aside buffer EXE ALU execution IPG Instruction pointer generation and fetch DET Exception detection ROT Instruction rotation WRB Write back EXP Instruction template decode, expand, and disperse FPx Floating-point pipe stage REN Rename (for register stack and rotating registers) and decode Figure 1. Itanium 2 processor pipeline. TLB and the L1I cache are arranged as The 64-byte L1I cache line holds four required for a prevalidated-tag design. The instruction bundles. The L1I can sustain a four-way set-associative L1I cache is 16 Kbytes stream of one 32-byte read per cycle to pro- in size, relatively small because of latency and vide two bundles per cycle to the back-end area design constraints but still optimal. An pipeline. The fetched bundles go directly to instruction prefetch engine enhances the the dispersal logic or into an instruction buffer cache’s effective size. The dual-ported tags and for later consumption. If the instruction TLB resolve demand and prefetch requests buffer is full, the front-end pipeline stalls. without conflict. The page offset of the virtu- The L1I TLB directly supports only a 4- al-address bits selects a set from the tag array Kbyte page size. The L1I TLB indirectly sup- and the data array for demand accesses. The ports larger page sizes by allocating additional upper bits of the virtual address determine entries as each 4-Kbyte segment of the larger which, if any, way holds the requested instruc- page is referenced. An L1I TLB miss implies a tions. The tag and TLB lookup results deter- miss in the L1I cache and can initiate L2I TLB mine a L1I hit or miss, as described earlier. and second-level (L2) cache accesses, as well as 46 IEEE MICRO a transfer of page information to the L1I TLB. TLB Cache The L2I TLB is a 128-entry, fully associative 0 structure with a single port. Each entry can Virtual address 1 0 0 0 1 0 represent all page sizes defined in the architec- Virtual address 2 1 0 0 0 ture from 4 Kbytes to 4 Gbytes. Up to 64 1 entries can be pinned as translation registers Virtual address 3 0 1 0 0 0 to ensure that hot pages are always available. Virtual address 4 0 0 0 0 In the event of an L2I TLB miss, the L2I TLB 0 Hit comparator requests the hardware page walker (HPW) to 0 fetch a translation from the virtual hashed page Virtual address 32 0 0 1 0 table.

Load more