Itanium Processor Microarchitecture
Total Page:16
File Type:pdf, Size:1020Kb
ITANIUM PROCESSOR MICROARCHITECTURE THE ITANIUM PROCESSOR EMPLOYS THE EPIC DESIGN STYLE TO EXPLOIT INSTRUCTION-LEVEL PARALLELISM. ITS HARDWARE AND SOFTWARE WORK IN CONCERT TO DELIVER HIGHER PERFORMANCE THROUGH A SIMPLER, MORE EFFICIENT DESIGN. The Itanium processor is the first ic runtime optimizations to enable the com- implementation of the IA-64 instruction set piled code schedule to flow through at high architecture (ISA). The design team opti- throughput. This strategy increases the syn- mized the processor to meet a wide range of ergy between hardware and software, and requirements: high performance on Internet leads to higher overall performance. servers and workstations, support for 64-bit The processor provides a six-wide and 10- addressing, reliability for mission-critical stage deep pipeline, running at 800 MHz on applications, full IA-32 instruction set com- a 0.18-micron process. This combines both patibility in hardware, and scalability across a abundant resources to exploit ILP and high range of operating systems and platforms. frequency for minimizing the latency of each The processor employs EPIC (explicitly instruction. The resources consist of four inte- parallel instruction computing) design con- ger units, four multimedia units, two Harsh Sharangpani cepts for a tighter coupling between hardware load/store units, three branch units, two and software. In this design style the hard- extended-precision floating-point units, and Ken Arora ware-software interface lets the software two additional single-precision floating-point exploit all available compilation time infor- units (FPUs). The hardware employs dynam- Intel mation and efficiently deliver this informa- ic prefetch, branch prediction, nonblocking tion to the hardware. It addresses several caches, and a register scoreboard to optimize fundamental performance bottlenecks in for compilation time nondeterminism. Three modern computers, such as memory latency, levels of on-package cache minimize overall memory address disambiguation, and control memory latency. This includes a 4-Mbyte flow dependencies. level-3 (L3) cache, accessed at core speed, pro- EPIC constructs provide powerful archi- viding over 12 Gbytes/s of data bandwidth. tectural semantics and enable the software to The system bus provides glueless multi- make global optimizations across a large processor support for up to four-processor sys- scheduling scope, thereby exposing available tems and can be used as an effective building instruction-level parallelism (ILP) to the hard- block for very large systems. The advanced ware. The hardware takes advantage of this FPU delivers over 3 Gflops of numeric capa- enhanced ILP, providing abundant execution bility (6 Gflops for single precision). The bal- resources. Additionally, it focuses on dynam- anced core and memory subsystems provide 24 0272-1732/00/$10.00 2000 IEEE Compiler-programmed features: Explicit Register Branch parallelism; Data and control Memory stack, Predication hints instruction speculation hints rotation templates Hardware features: Fetch IssueRegister Control Parallel resources Memory handling subsystem 4 integer, 4 MMX units 128 GR, 2 + 2 FMACs Instruction 128 FR, cache, register Three levels branch remap, 2 load/store units of cache predictors stack (L1, L2, L3) engine 3 branch units Fast, simple 6-issue Fast, 32-entry ALAT Bypasses and dependencies Speculation deferral management Figure 1. Conceptual view of EPIC hardware. GR: general register file; FR: floating-point regis- ter file high performance for a wide range of appli- events that are unpredictable at compi- cations ranging from commercial workloads lation time so that the compiled code to high-performance technical computing. flows through the pipeline at high In contrast to traditional processors, the throughput. machine’s core is characterized by hardware support for the key ISA constructs that Figure 1 presents a conceptual view of the embody the EPIC design style.1,2 This EPIC hardware. It illustrates how the various includes support for speculation, predication, EPIC instruction set features map onto the explicit parallelism, register stacking and rota- micropipelines in the hardware. tion, branch hints, and memory hints. In this The core of the machine is the wide execu- article we describe the hardware support for tion engine, designed to provide the compu- these novel constructs, assuming a basic level tational bandwidth needed by ILP-rich EPIC of familiarity with the IA-64 architecture (see code that abounds in speculative and predi- the “IA-64 Architecture Overview” article in cated operations. this issue). The execution control is augmented with a bookkeeping structure called the advanced EPIC hardware load address table (ALAT) to support data The Itanium processor introduces a num- speculation and, with hardware, to manage ber of unique microarchitectural features to the deferral of exceptions on speculative exe- support the EPIC design style.2 These features cution. The hardware control for speculation focus on the following areas: is quite simple: adding an extra bit to the data path supports deferred exception tokens. The • supplying plentiful fast, parallel, and controls for both the register scoreboard and pipelined execution resources, exposed bypass network are enhanced to accommo- directly to the software; date predicated execution. • supporting the bookkeeping and control Operands are fed into this wide execution for new EPIC constructs such as predi- core from the 128-entry integer and floating- cation and speculation; and point register files. The register file addressing • providing dynamic support to handle undergoes register remapping, in support of SEPTEMBER–OCTOBER 2000 25 ITANIUM PROCESSOR M F I M F I 6 instructions provide: semantically richer register-remapping hard- • 12 parallel ops/clock ware. Expensive register dependency-detec- for scientific computing tion logic is eliminated via the explicit • 20 parallel ops/clock for Load 4 DP 2 ALU ops parallelism directives that are precomputed by digital content creation (8 SP) ops via the software. 2 ldf-pair and 2 4 DP flops ALU ops (8 SP flops) Using EPIC constructs, the compiler opti- (postincrement) mizes the code schedule across a very large scope. This scope of optimization far exceeds M I I M B B 6 instructions provide: the limited hardware window of a few hun- • 8 parallel ops/clock dred instructions seen on contemporary for enterprise and Internet applications dynamically scheduled processors. The result 2 loads and 2 ALU ops is an EPIC machine in which the close col- 2 ALU ops 2 branch (postincrement) instructions laboration of hardware and software enables high performance with a greater degree of Figure 2. Two examples illustrating supported parallelism. SP: single preci- overall efficiency. sion, DP: double precision Overview of the EPIC core The engineering team designed the EPIC register stacking and rotation. The register core of the Itanium processor to be a parallel, management hardware is enhanced with a deep, and dynamic pipeline that enables ILP- control engine called the register stack engine rich compiled code to flow through at high that is responsible for saving and restoring throughput. At the highest level, three impor- registers that overflow or underflow the reg- tant directions characterize the core pipeline: ister stack. An instruction dispersal network feeds the • wide EPIC hardware delivering a new execution pipeline. This network uses explic- level of parallelism (six instructions/ it parallelism and instruction templates to effi- clock), ciently issue fetched instructions onto the • deep pipelining (10 stages) enabling high correct instruction ports, both eliminating frequency of operation, and complex dependency detection logic and • dynamic hardware for runtime opti- streamlining the instruction routing network. mization and handling of compilation A decoupled fetch engine exploits advanced time indeterminacies. prefetch and branch hints to ensure that the fetched instructions will come from the cor- New level of parallel execution rect path and that they will arrive early enough The processor provides hardware for these to avoid cache miss penalties. Finally, memo- execution units: four integer ALUs, four mul- ry locality hints are employed by the cache timedia ALUs, two extended-precision float- subsystem to improve the cache allocation and ing-point units, two additional single-precision replacement policies, resulting in a better use floating-point units, two load/store units, and of the three levels of on-package cache and all three branch units. The machine can fetch, associated memory bandwidth. issue, execute, and retire six instructions each EPIC features allow software to more effec- clock cycle. Given the powerful semantics of tively communicate high-level semantic infor- the IA-64 instructions, this expands to many mation to the hardware, thereby eliminating more operations being executed each cycle. redundant or inefficient hardware and lead- The “Machine resources per port” sidebar on ing to a more effective design. Notably absent p. 31 enumerates the full processor execution from this machine are complex hardware resources. structures seen in dynamically scheduled con- Figure 2 illustrates two examples demon- temporary processors. Reservation stations, strating the level of parallel operation support- reorder buffers, and memory ordering buffers ed for various workloads. For enterprise and are all replaced by simpler hardware for spec- commercial codes,