A Dual-Core, Dual-Thread Itanium Processor

MONTECITO: A DUAL-CORE, DUAL-THREAD ITANIUM PROCESSOR INTEL’S MONTECITO IS THE FIRST ITANIUM PROCESSOR TO FEATURE DUPLICATE, DUAL-THREAD CORES AND CACHE HIERARCHIES ON A SINGLE DIE. IT FEATURES A LANDMARK 1.72 BILLION TRANSISTORS AND SERVER-FOCUSED TECHNOLOGIES, AND IT REQUIRES ONLY 100 WATTS OF POWER. Intel’s Itanium 2 processor series has cycle. This eight-entry queue decouples the regularly delivered additional performance front end from the back and delivers up to through the increased frequency and cache as two bundles of any alignment to the remain- evidenced by the 6-Mbyte and 9-Mbyte ver- ing six pipeline stages. The dispersal logic sions.1 Montecito is the next offering in the determines issue groups and allocates up to Itanium processor family and represents many six instructions to nearly every combination firsts for both Intel and the computing indus- of the 11 available functional units (two inte- try. Its 1.7 billion transistors extend the Itani- ger, four memory, two floating point, and um 2 core with an enhanced form of temporal three branch). The renaming logic maps vir- multithreading and a substantially improved tual registers specified by the instruction into cache hierarchy. In addition to these land- physical registers, which access the actual reg- marks, designers have incorporated technolo- ister file (12 integer and eight floating-point Cameron McNairy gies and enhancements that target reliability read ports) in the next stage. Instructions then and manageability, power efficiency, and per- perform their operation or issue requests to Rohit Bhatia formance through the exploitation of both the cache hierarchy. The full bypass network instruction- and thread-level parallelism. The allows nearly immediate access to previous Intel result is a single 21.5-mm ¥ 27.7-mm die2 that instruction results while the retirement logic can execute four independent contexts on two writes final results into the register files (10 cores with nearly 27 Mbytes of cache, at over integer and 10 floating-point write ports). 1.8 GHz, yet consumes only 100W of power. Figure 2 is a block diagram of Montecito, which aims to preserve application and oper- Beyond Itanium 2 ating system investments while providing Figure 1 is a block diagram of the Itanium greater opportunity for code generators to 2 processor.3 The front end—with two levels continue their steady performance push. This of branch prediction, two translation look- opportunity is important, because even three aside buffers (TLBs) and a zero-cycle branch years after the Itanium 2’s debut, compilers predictor—feeds two bundles (three instruc- continue to be a source of significant perfor- tions each) into the instruction buffer every mance improvement. Unfortunately, compiler- 10 Published by the IEEE Computer Society 0272-1732/05/$20.00 ” 2005 IEEE optimization compatibility—which lets processors run each other’s code optimally—limits L1 Branch Instruction cache (16 KB) prediction TLB the freedom to explore aggressive ways of increasing cycle-for-cycle performance and B B B I I M M M M F F overall frequency. Single-thread performance improvements Register stack engine/Rename Internal evaluations of Itanium 2 code indi- cate that static instructions per cycle (IPC) Branch and Integer Floating- hover around three and often reach six for a predicate registers point registers registers wide array of workloads. Dynamic IPC decreases from the static highs for nearly all Integer Memory/ Floating- Branch unit workloads. The IPC reduction is primarily due unit Integer point unit to inefficient cache-hierarchy accesses, and to a small degree, functional unit asymmetries L1D Data ALAT and inefficiencies in branching and specula- cache (16 KB) TLB tion. Montecito targeted these performance weaknesses, optimizing nearly every core block and piece of control logic to improve some per- L2 formance aspect. cache (256 KB) Asymmetry, branching, and speculation Queues/ L3 Control cache (3 MB) To address the port-asymmetry problem, Montecito adds a second integer shifter yielding a performance improvement of nearly 100 per- System interface cent for important cryptographic codes. To address branching inefficiency, Montecito’s front end removes the bottlenecks surrounding Figure 1. Block diagram of Intel’s Itanium 2 processor. B, I, single-cycle branches, which are prevalent in M, and F: branch, integer, memory, and floating-point func- integer and enterprise workloads. Finally, Mon- tional units; ALAT: advanced load address table; TLB: trans- tecito decreases the time to reach recovery code lation look-aside buffer. when control or data speculation fails, thereby lowering the cost of speculation and enabling the code to use speculation more effectively.4 accesses. The performance levels of Mon- tecito’s L1D and L1I caches are similar to Cache hierarchy those in the Itanium 2, but Montecito’s L1I Montecito supports three levels of on-chip and L1D have additional data protection. cache. Each core contains a complete cache hierarchy, with nearly 13.3 Mbytes per core, for a Level 2 caches. The real differences from the total of nearly 27 Mbytes of processor cache. Itanium 2’s cache hierarchies start at the L2 caches. The Itanium 2’s L2 shares data and Level 1 caches. The L1 caches (L1I and L1D) instructions, while the Montecito has dedi- are four-way, set associative, and each holds cated instruction (L2I) and data (L2D) 16 Kbytes of instructions or data. Like the rest caches. This separation of instruction and data of the pipeline, these caches are in order, but caches makes it possible to have dedicated they are also nonblocking, which enables high access paths to the caches and thus eliminates request concurrency. Access to L1I and L1D contention and capacity pressures at the L2 is through prevalidated tags and occurs in a caches. For enterprise applications, Mon- single cycle. L1D is write-through, dual- tecito’s dedicated L2 caches can offer up to a ported and banked to support two integer 7-percent performance increase. loads and two stores each cycle. The L1I has dual-ported tags and a single data port to sup- The L2I holds 1 Mbyte; is eight-way, set asso- port simultaneous demand and prefetch ciative; and has a 128-byte line size—yet has MARCH–APRIL 2005 11 HOT CHIPS 16 L1 Branch Instruction L1 Branch Instruction cache (16 KB) prediction TLB cache (16 KB) prediction TLB B B B I I M M M M F F B B B I I M M M M F F Register stack engine/Rename Register stack engine/Rename Foxton technology Branch and Integer Floating- Branch and Integer Floating- predicate registers point predicate registers point registers registers registers registers Integer Memory/ Floating- Integer Memory/ Floating- Branch unit Branch unit unit Integer point unit unit Integer point unit L1D Data System interface L1D Data ALAT ALAT cache (16 KB) TLB cache (16 KB) TLB L2D L2I L2D L2I cache (256 KB) cache (1 MB) cache (256 KB) cache (1 MB) Queues/ L3 Queues/ L3 Control cache (12 MB) Control cache (12 MB) Arbiter Synchronizer Synchronizer Figure 2. Block diagram of Intel’s Montecito. The dual cores and threads realize performance unattainable in the Itanium 2 processor. Montecito also addresses Itanium 2 port asymmetries and inefficiencies in branching, speculation, and cache hierarchy. the same seven-cycle instruction-access latency and increase the L2 miss latency. Montecito as the smaller Itanium 2 unified cache. The tag suspends such secondary misses until the L2D and data arrays are single ported, but the con- fill occurs. At that point, the fill immediately trol logic supports out-of-order and pipelined satisfies the suspended request. This approach accesses, which enable a high utilization rate. greatly reduces bandwidth contention and Montecito’s L2D has the same structure final latency. The L2D, like the Itanium 2’s and organization as the Itanium 2’s shared L2, is out of order, pipelined, and tracks 32 256-Kbyte L2 cache but with several micro- requests (L2D hits or L2D misses not yet architectural improvements to increase passed to the L3 cache) in addition to 16 miss- throughput. The L2D hit latency remains at es and their associated victims. The difference five cycles for integer and six cycles for float- is that Montecito allocates the 32 queue ing-point accesses. The tag array is true four- entries more efficiently, which provides a high- ported—four fully independent accesses in er concurrency level than with the Itanium 2. the same cycle—and the data array is pseudo- four-ported with 16-byte banks. Level 3 cache. Montecito’s L3 cache remains Montecito optimizes several aspects of the unified as in previous Itanium processors, but L2D. In the Itanium 2, any accesses to the is now 12 Mbytes. Even so, it maintains the same cache line beyond the first access that same 14-cycle integer-access latency typical of misses L2 will access the L2 tags periodically the L3s in the 6- and 9-Mbyte Itanium 2 fam- until the tags detect a hit. The repeated tag ily. Montecito’s L3 uses an asynchronous queries consume bandwidth from the core interface with the data array to achieve this 12 IEEE MICRO Ai Idle Ai+1 Idle Ai+2 Bi Idle Bi+1 Idle Bi+2 Ai Ai+1 Ai+2 Hidden latency Bi Bi+1 Bi+2 Time Figure 3. How two threads share a core. Control logic monitors the workload’s behavior and dynamically adjusts the time quantum for a thread. If the control logic determines that a thread is not making progress, it suspends that thread and gives execution resources to the other thread. This partially offsets the cost of long latency operations, such as memory accesses. The time to execute a thread switch—white rectangles at the side of each box—consumes a portion of the idle time.

A Dual-Core, Dual-Thread Itanium Processor

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support