<<

MONTECITO: A DUAL-CORE, DUAL-

INTEL’S MONTECITO IS THE FIRST ITANIUM PROCESSOR TO FEATURE

DUPLICATE, DUAL-THREAD CORES AND HIERARCHIES ON A SINGLE .

IT FEATURES A LANDMARK 1.72 BILLION AND -FOCUSED

TECHNOLOGIES, AND IT REQUIRES ONLY 100 WATTS OF POWER.

Intel’s Itanium 2 processor series has cycle. This eight-entry queue decouples the regularly delivered additional performance front end from the back and delivers up to through the increased frequency and cache as two bundles of any alignment to the remain- evidenced by the 6-Mbyte and 9-Mbyte ver- ing six stages. The dispersal logic sions.1 Montecito is the next offering in the determines issue groups and allocates up to Itanium processor family and represents many six instructions to nearly every combination firsts for both Intel and the computing indus- of the 11 available functional units (two inte- try. Its 1.7 billion transistors extend the Itani- ger, four memory, two floating point, and um 2 core with an enhanced form of temporal three branch). The renaming logic maps vir- multithreading and a substantially improved tual registers specified by the instruction into . In addition to these land- physical registers, which access the actual reg- marks, designers have incorporated technolo- ister file (12 integer and eight floating-point Cameron McNairy gies and enhancements that target reliability read ports) in the next stage. Instructions then and manageability, power efficiency, and per- perform their operation or issue requests to Rohit Bhatia formance through the exploitation of both the cache hierarchy. The full bypass network instruction- and thread-level parallelism. The allows nearly immediate access to previous Intel result is a single 21.5-mm ¥ 27.7-mm die2 that instruction results while the retirement logic can execute four independent contexts on two writes final results into the register files (10 cores with nearly 27 Mbytes of cache, at over integer and 10 floating-point write ports). 1.8 GHz, yet consumes only 100W of power. Figure 2 is a block diagram of Montecito, which aims to preserve application and oper- Beyond Itanium 2 ating system investments while providing Figure 1 is a block diagram of the Itanium greater opportunity for code generators to 2 processor.3 The front end—with two levels continue their steady performance push. This of branch prediction, two translation look- opportunity is important, because even three aside buffers (TLBs) and a zero-cycle branch years after the Itanium 2’s debut, predictor—feeds two bundles (three instruc- continue to be a source of significant perfor- tions each) into the instruction buffer every mance improvement. Unfortunately, -

10 Published by the IEEE Society 0272-1732/05/$20.00 ” 2005 IEEE optimization compatibility—which lets proces- sors run each other’s code optimally—limits L1 Branch Instruction cache (16 KB) prediction TLB the freedom to explore aggressive ways of increasing cycle-for-cycle performance and B B B I I M M M M F F overall frequency.

Single-thread performance improvements Register stack engine/Rename Internal evaluations of Itanium 2 code indi- cate that static (IPC) Branch and Integer Floating- hover around three and often reach six for a predicate registers point registers registers wide array of workloads. Dynamic IPC decreases from the static highs for nearly all Integer Memory/ Floating- Branch unit workloads. The IPC reduction is primarily due unit Integer point unit to inefficient cache-hierarchy accesses, and to a small degree, functional unit asymmetries L1D Data ALAT and inefficiencies in branching and specula- cache (16 KB) TLB tion. Montecito targeted these performance weaknesses, optimizing nearly every core block and piece of control logic to improve some per- L2 formance aspect. cache (256 KB)

Asymmetry, branching, and speculation Queues/ L3 Control cache (3 MB) To address the port-asymmetry problem, Montecito adds a second integer shifter yielding a performance improvement of nearly 100 per- System cent for important cryptographic codes. To address branching inefficiency, Montecito’s front end removes the bottlenecks surrounding Figure 1. Block diagram of Intel’s Itanium 2 processor. B, I, single-cycle branches, which are prevalent in M, and F: branch, integer, memory, and floating-point func- integer and enterprise workloads. Finally, Mon- tional units; ALAT: advanced load address table; TLB: trans- tecito decreases the time to reach recovery code lation look-aside buffer. when control or data speculation fails, thereby lowering the cost of speculation and enabling the code to use speculation more effectively.4 accesses. The performance levels of Mon- tecito’s L1D and L1I caches are similar to Cache hierarchy those in the Itanium 2, but Montecito’s L1I Montecito supports three levels of on-chip and L1D have additional data protection. cache. Each core contains a complete cache hier- archy, with nearly 13.3 Mbytes per core, for a Level 2 caches. The real differences from the total of nearly 27 Mbytes of processor cache. Itanium 2’s cache hierarchies start at the L2 caches. The Itanium 2’s L2 shares data and Level 1 caches. The L1 caches (L1I and L1D) instructions, while the Montecito has dedi- are four-way, set associative, and each holds cated instruction (L2I) and data (L2D) 16 Kbytes of instructions or data. Like the rest caches. This separation of instruction and data of the pipeline, these caches are in order, but caches makes it possible to have dedicated they are also nonblocking, which enables high access paths to the caches and thus eliminates request concurrency. Access to L1I and L1D contention and capacity pressures at the L2 is through prevalidated tags and occurs in a caches. For enterprise applications, Mon- single cycle. L1D is write-through, dual- tecito’s dedicated L2 caches can offer up to a ported and banked to support two integer 7-percent performance increase. loads and two stores each cycle. The L1I has dual-ported tags and a single data port to sup- The L2I holds 1 Mbyte; is eight-way, set asso- port simultaneous demand and prefetch ciative; and has a 128-byte line size—yet has

MARCH–APRIL 2005 11 HOT CHIPS 16

L1 Branch Instruction L1 Branch Instruction cache (16 KB) prediction TLB cache (16 KB) prediction TLB

B B B I I M M M M F F B B B I I M M M M F F

Register stack engine/Rename Register stack engine/Rename Foxton technology

Branch and Integer Floating- Branch and Integer Floating- predicate registers point predicate registers point registers registers registers registers

Integer Memory/ Floating- Integer Memory/ Floating- Branch unit Branch unit unit Integer point unit unit Integer point unit

L1D Data System interface L1D Data ALAT ALAT cache (16 KB) TLB cache (16 KB) TLB

L2D L2I L2D L2I cache (256 KB) cache (1 MB) cache (256 KB) cache (1 MB)

Queues/ L3 Queues/ L3 Control cache (12 MB) Control cache (12 MB) Arbiter

Synchronizer Synchronizer

Figure 2. Block diagram of Intel’s Montecito. The dual cores and threads realize performance unattainable in the Itanium 2 processor. Montecito also addresses Itanium 2 port asymmetries and inefficiencies in branching, speculation, and cache hierarchy.

the same seven-cycle instruction-access latency and increase the L2 miss latency. Montecito as the smaller Itanium 2 unified cache. The tag suspends such secondary misses until the L2D and data arrays are single ported, but the con- fill occurs. At that point, the fill immediately trol logic supports out-of-order and pipelined satisfies the suspended request. This approach accesses, which enable a high utilization rate. greatly reduces bandwidth contention and Montecito’s L2D has the same structure final latency. The L2D, like the Itanium 2’s and organization as the Itanium 2’s shared L2, is out of order, pipelined, and tracks 32 256-Kbyte L2 cache but with several micro- requests (L2D hits or L2D misses not yet architectural improvements to increase passed to the L3 cache) in addition to 16 miss- throughput. The L2D hit latency remains at es and their associated victims. The difference five cycles for integer and six cycles for float- is that Montecito allocates the 32 queue ing-point accesses. The tag array is true four- entries more efficiently, which provides a high- ported—four fully independent accesses in er concurrency level than with the Itanium 2. the same cycle—and the data array is pseudo- four-ported with 16-byte banks. Level 3 cache. Montecito’s L3 cache remains Montecito optimizes several aspects of the unified as in previous Itanium processors, but L2D. In the Itanium 2, any accesses to the is now 12 Mbytes. Even so, it maintains the same cache line beyond the first access that same 14-cycle integer-access latency typical of misses L2 will access the L2 tags periodically the L3s in the 6- and 9-Mbyte Itanium 2 fam- until the tags detect a hit. The repeated tag ily. Montecito’s L3 uses an asynchronous queries consume bandwidth from the core interface with the data array to achieve this

12 IEEE MICRO Ai Idle Ai+1 Idle Ai+2 Bi Idle Bi+1 Idle Bi+2

Ai Ai+1 Ai+2 Hidden latency

Bi Bi+1 Bi+2

Time

Figure 3. How two threads share a core. Control logic monitors the workload’s behavior and dynamically adjusts the time quantum for a thread. If the control logic determines that a thread is not making progress, it suspends that thread and gives execution resources to the other thread. This partially offsets the cost of long latency operations, such as memory accesses. The time to execute a thread —white rectangles at the side of each box—consumes a portion of the idle time. low latency; there is no clock, only a read or In fact, although the L2I reduces instruction write valid indication.5 The read signal is coin- stream stalls, it increases the data stall com- cident with index and way values that initiate ponent as a percentage of execution time.6 L3 data-array accesses. Four cycles later, the As Figure 3 shows, Montecito addresses the entire 128-byte line is available and latched. problem of ever-growing memory latency by The array then delivers this data in four cycles offering two hardware-managed threads that to either the L2D or L2I in critical-byte order. reduce stall time and provide additional per- The asynchronous design not only helps lower formance. The dual-thread implementation access time, but also significantly reduces duplicates all architectural and some power over the previous synchronous designs. microarchitectural state to create two logical Montecito’s L3 receives requests from both processors that share all the parallel execution the L2I and L2D but gives priority to the L2I resources and the . Resource request in the rare case of a conflict. Conflicts sharing is a blend of are rare because Montecito moves the arbi- (TMT) for the core and simultaneous multi- tration point from the Itanium 2’s L1-L2 to threading (SMT) for the memory hierarchy. L2-L3, which greatly reduces conflicts because The cache hierarchy simultaneously shares of L2I and L2D’s high hit rates. resources, such as queue entries and cache lines, across the two threads, which gives each Dual threads thread equal access to the resources. Mon- Itanium 2 processors have led the industry tecito’s designers chose not to pursue an SMT in all three categories of performance: integer, approach at the core level for several reasons. floating-point, and throughput computing. The most important is that internal studies However, the gap between processor speed show an average three and a maximum six IPC and memory-system speed continues to for integer and enterprise workloads, and an increase with each semiconductor generation, average five IPC for technical workloads before and the additional memory latency typical of cache misses cause a degradation. Thus, the large systems only exacerbates the problem. core utilization level is not the problem; rather For many integer and floating-point applica- it is decreasing the impact of memory access- tions, instruction and data prefetching in es on IPC. Given the underlying in-order exe- combination with Montecito’s fast and large cution pipeline, the small potential return for cache hierarchy effectively bridge the supporting SMT at the core would not justi- processor-memory chasm. However, com- fy its cost. The improvements in the cache mercial transaction-processing codes, charac- hierarchy, though intended to benefit terized by large memory footprints, realize instruction-level parallelism, accommodate the only limited benefits from these techniques. increased demand that dual threads create.

MARCH–APRIL 2005 13 HOT CHIPS 16

Other shared structures, such as branch- • L3 cache miss/data return. L3 misses and prediction structures and the advance load data returns can trigger thread , address table (ALAT), require special consid- subject to thread urgency comparison eration to support the two threads. The (described later). An L3 miss in the fore- branch-prediction structures use thread tag- ground thread is likely to cause that ging so that one thread’s predictions do not thread to stall and hence initiate a switch. taint the other’s prediction history. Each Similarly, a data return to the L3 for the thread contains a complete set of return stack background thread is likely to resolve structures as well as an ALAT, since simple data dependences and is an early indica- tagging would not be sufficient for these key tion of execution readiness. performance structures. • Timeout. Thread-quantum counters Duplicating the complete architectural state ensure fairness in a thread’s access to the represents the most significant change to the pipeline execution resources. If the thread core and requires a core area (excluding the quantum expires when the thread was L3 cache) less than 2 percent larger than that effectively using the core, thread-switch for a single-thread implementation. Moreover, control logic switches it to the back- this growth, though mainly in the already ground and sets its urgency to a high large register files, does not affect the register value to indicate its execution readiness. file and bypass network’s critical paths because • ALAT invalidation. In certain circum- Montecito includes a memory cell in which stances, utilization and performance two storage cells share bit lines.7 increase when a thread yields pipeline resources to the background thread while Thread switch control waiting for some system events. For exam- In-order core execution, coupled with the ple, spin-lock codes can allocate an address exclusive-resource-ownership attribute of tem- into the ALAT and then execution poral multithreading, lets Montecito incor- to the background thread. When an exter- porate two threads without affecting legacy nal access invalidates the ALAT entry, the software optimizations and with minimal action causes a thread switch back to the changes to pipeline control logic. The com- yielding thread. This event lets the threads petitively shared resources of the memory sub- share core-execution resources efficiently system (caches and TLBs) required additional while decreasing the total time the soft- effort to ensure correctness and to provide spe- ware spends in critical sections. cial optimizations that avoid negative thread • Switch hint. The Itanium architecture interactions. Part of this additional effort cen- provides the hint@pause instruc- tered on thread-switch control. tion, which can trigger a thread switch to The control logic for thread switching yield execution to the background leverages the existing pipeline-flush mecha- thread. The software can then indicate nism to transfer control from one thread to when the current thread does not need the other. A thread switch is asynchronous to core resources. the core pipeline and is attached to the issue • Low-power mode. When the active thread group of instructions in the exception- has transitioned to a quiesced low-power detection (DET) pipeline stage.3 The switch mode, the action triggers a thread switch becomes the highest priority event and caus- to the background thread so that it can es a pipeline flush without allowing the issue continue execution. If both threads are group to raise faults or commit results. The in a quiesced low-power state, an inter- thread-switch control logic asserts this rupt targeting the background thread pipeline flush for seven cycles for a total awakens the thread and triggers a thread switch penalty of 15 cycles—seven more than switch, while an targeting the the typical eight-cycle pipeline flush. foreground thread just awakens the fore- Five events can lead to a thread switch, ground thread. many of which affect a thread’s urgency—an indication of its ability to use core resources Given that reducing the negative impact of effectively: memory latency is the primary motivation for

14 IEEE MICRO having two threads, the most common switch event is the L3 cache/data return. Other T0 T1 events, such as the timeout, provide fairness, while the ALAT invalidation and switch hint T0 executing 5 5 T1 in background events provide paths for the software to influ- ence thread switches. T0 issues load that 4 5 misses L3 cache

Switching based on thread urgency Thread Each thread has an urgency that can take switch on values from 0 to 7. A value of 0 denotes that a thread has no useful work to perform. T0 is in background, 3 5 initiated another access A value of 7 signifies that a thread was active- before thread switch ly making forward progress before it was forced to the background. The nominal Load return from T0 but 4 5 T1 is still higher urgency urgency value of 5 indicates that a thread is so no thread switch occurs actively progressing. T1 issues load that Figure 4 shows a typical urgency-based 4 4 switch scenario. Whenever the software expe- misses L3 cache riences a cache hierarchy miss or a system inter- Load returns from T0 face data return (L3 miss/return event), the 5 4 thread-switch control logic compares the Thread urgency values of the two threads. If the switch urgency of the foreground thread is lower than that of the background thread, the control logic 5 4 will initiate a thread switch. Every L3 miss decrements the urgency by 1, eventually satu- Background thread Foreground thread rating at 0. Similarly, every data return from the system interface increments the urgency by 1 as long as the urgency is below 5. Figure 4. Urgency and thread switches on the Montecito The thread-switch control logic sets the processor. urgency to 7 for a thread that switches because of a timeout event. An external interrupt direct- ed at the background thread will set the urgency of the five-load configuration. The increased for that thread to 6, which increases the prob- bandwidth easily absorbs the additional pres- ability of a thread switch and provides a rea- sure that the dual threads put on the system sonable response time for interrupt servicing. interface. Figure 5 is a block diagram of the arbiter, Dual cores and the arbiter which organizes and optimizes each core’s As we described earlier, Montecito is a sin- request to the system interface, ensures fair- gle die with duplicate dual-threaded cores and ness and forward progress, and collects caches. The two cores attach to the system responses from each core to provide a unified interface through the arbiter, which provides system interface response. The arbiter main- a low-latency path for each core to initiate and tains each core’s unique identity to the system respond to system events. interface and operates at a fixed ratio to the Because the two cores share a system inter- system interface frequency. An asynchronous face, the bandwidth demands for the shared- interface between the arbiter and each core topology effectively double. Consequently, lets the core and cache frequency vary as need- the system interface bandwidth increases from ed to support Foxton technology, which is key 6.4 gigabytes per second (GBps) with up to to Montecito’s power-management strategy. five electrical loads to 10.66 GBps with up to As the figure shows, the arbiter consists of a three electrical loads. Montecito systems can set of address queues, data queues, and syn- use the higher bandwidth three-load design chronizers, as well as logic for core and system while maintaining the computational density interface arbitration. (For simplicity, the fig-

MARCH–APRIL 2005 15 HOT CHIPS 16

requests to the cores as need- Core 0 ed and coalesces the snoop Write requests Core response from each core into a unified snoop response for Castout requests L3 cache the socket. The arbiter delivers all data Synchronizer Read requests Snoops returns directly to the appro- System interface priate core using a unique Snoop coalescing identifier provided with the initial request. It delivers Snoops Write requests broadcast transactions, such L3 cache System as and TLB purges, Castout requests interface to both cores in the same way control Core that delivery would occur if Synchronizer Read requests each core were connected Core 1 directly to the system inter- face. Figure 5. Montecito’s arbiter and its queues. Fairness and arbitration The arbiter interleaves core ure omits the error-correction code (ECC) requests on a one-to-one basis when both cores encoders/decoders and parity generators.) The have transactions to issue. When only one core arbitration logic ensures that each core gets an has requests, it can issue its requests without equal share of bandwidth during peak demand the other core having to issue a transaction. and low latency access during low demand. Because read latency is the greatest concern, The core initiates one of three types of the read requests are typically the highest pri- accesses, which the arbiter allocates to the fol- ority, followed by writes, and finally clean lowing queues and buffers: castouts. Each core tracks the occupancy of the arbiter’s queues using a credit system for flow • Request queue. This is the primary address control. As requests complete, the arbiter queue that supports most request types. informs the appropriate core of the type and Each core has four request queues. number of deallocated queue entries. The • Write address queue. This queue holds cores use this information to determine which, addresses only and handles explicit write- if any, transaction to issue to the arbiter. backs and partial line writes. Each core has two write address queues. Synchronization • Clean castout queue. This queue holds the Given each core’s variable frequency and the address for the clean castout (directory system interface’s fixed frequency, the arbiter and snoop filter update) transactions. is the perfect place to synchronize the The arbiter holds pending transactions required communication channels. Montecito until it issues them on the system inter- allocates all communication between the core face. Each core has four clean castout and the arbiter to an entry in an array, or syn- queues. chronizer, in that core’s clock domain. The • Write . This buffer holds out- arbiter then reads from the array’s entries in bound data and has a one-to-one corre- its own clock domain. These synchronizers spondence with addresses in the write can send data directly to the core without address queue. Each core has four write queuing when a synchronizer is empty. data buffers, with the additional two With this design, the core can run much buffers holding implicit writeback data. faster or much slower (within limits) than the arbiter depending on the system’s and cores’ Other structures in the arbiter track and initi- power constraints. Communication from the ate communication from the system interface arbiter to the core occurs in a similar, but to the cores. The Snoop queue issues snoop reversed, manner using the synchronizers. A few

16 IEEE MICRO signals do not need queuing, so simple syn- chronization for clock-domain crossing suffices. Thermal Power efficiency Variable If the Montecito design team had taken the voltage Itanium 2 power requirement of 130W as a supply baseline, implementing Montecito using a 90- Ammeter nm would have required power deliv- ery and dissipation of nearly 300W.2 And this excludes the additional power required from the increased resource use resulting from improvements targeting instruction- and thread-level parallelism. Voltage Voltage-to-frequency Clock sensor converter network Montecito, however, requires only 100W. Such a power constraint typically comes with severe frequency sacrifices, but as we described earlier, Montecito can execute enterprise appli- Figure 6. How Foxton technology controls voltage and fre- cations using all the available cache, thread, and quency in Montecito. Foxton Technology uses two control core features at over 1.8 GHz. Foxton technol- loops, one for voltage and one for frequency. Components ogy is one of the key innovations that make this in the gray area are part of Montecito. performance possible.

Dynamic voltage and frequency management Voltage and frequency control loops. Figure 6 Foxton technology’s goal and methods are shows how Foxton technology can provide a significantly different from those of power- near cubic reduction in power when needed. management technologies in other processors. An on-chip ammeter feeds into an on-chip Perhaps the greatest difference is that Foxton microcontroller that uses digital signal pro- technology attempts to tap unused power by cessing to determine power consumption. dynamically adjusting processor voltage and The microcontroller runs a real-time sched- frequency to ensure the highest frequency8 uler to support multiple tasks—calibrating within temperature and power constraints. The the ammeter, sampling temperature, per- 6-Mbyte Itanium 2, for example, consumes forming power calculations, and determining 130W for worst case code1 and has power lim- the new voltage level. The microcontroller will its that keep it from using a higher voltage, and also identify when available control (voltage thus a higher frequency or performance point. or frequency) cannot contain the over-power But low-power applications, such as enterprise or over-temperature condition and indicates or integer workloads, often consume only these situations to Montecito’s platform for 107W. Consequently, 23W of power—and mitigation. If catastrophic failures occur in hence performance—remain untapped. the cooling system, the microcontroller will When Foxton technology detects lower safely shut down the processor. power consumption, it increases the voltage. When the microcontroller detects a need, The ensuing frequency for enterprise it will change the voltage it requests of the applications is 10 percent over the base fre- variable voltage supply in 12.5 mV incre- quency. This frequency change is nearly instan- ments. The voltage control loop responds taneous and exposes the entire power and within 100 ms. thermal envelope to every application and pro- tects against power and temperature overages. Voltage-frequency coordination. Because volt- Foxton technology also enables optimal sup- age directly impacts the transistors’ ability to port of demand-based switching (DBS) so that operate correctly at a given frequency, Mon- the or server management tecito monitors the voltage throughout the software can lower the overall socket power chip with 24 voltage .9 The sensors requirement when saving power is important. respond to local voltage changes from current- induced droops and global voltage changes

MARCH–APRIL 2005 17 HOT CHIPS 16

that Foxton technology has imposed. The sen- power efficiency. Many improvements sors select an appropriate frequency for the throughout the core and cache increase power given voltage and coordinate efforts to deter- efficiency as well. , however, a mine the appropriate frequency: If one sen- design strategy traditionally used to reduce sor detects a lower voltage, all sensors decrease average power, found limited application in frequency. Likewise, the last sensor to mea- Montecito. The idea behind clock gating is to sure a higher voltage enables all detectors to suspend clocks to logic when that logic is idle. increase frequency. However, because Montecito starts with a A digital frequency divider (DFD) provides highly utilized pipeline and adds improve- the selected frequency, in nearly 20-MHz ments to boost performance, few areas of logic increments and within a single cycle, without remain idle for an extended period. Moreover, requiring the on-chip phase-locked loops turning the suspended clocks back on comes (PLLs) to resynchronize.10 with a latency that either negatively affects per- Thus, Foxton technology can specify a volt- formance or forces speculative clock enabling, age, monitor the voltage at the cores, and have which in turn lessens clock gating’s benefit. the clock network respond nearly instanta- For these reasons, Montecito applies gating neously. The fine increments and response for blatantly useless signal transitions, but times in both the clock and the voltage planes avoids it when clock gating would negatively mean that the application can use the entire affect overall performance. However, in the power envelope. This contrasts to alternatives L3 cache, the Montecito designers took clock in which the frequency and voltage have spe- gating to the extreme by completely remov- cific pairs and often require PLL resynchro- ing the clock for data-array accesses. The asyn- nization at transitions, leaving parts of the chronous interface to the L3 data array power envelope untapped. drastically lowers the power required to com- plete a read or write operation. A synchronous Demand-based switching design, with a clock transitioning every cycle At times, it makes sense to sacrifice perfor- over a large area, is estimated to consume mance for lower power, such as when proces- about 10W more than the asynchronous sor utilization is low. For these cases, design. Montecito designers also replaced Montecito provides DBS, which lets an oper- many power-hungry circuits with less aggres- ating system that is monitoring power to spec- sive and more power-conscious ones. All these ify a lower power envelope and realize power power savings result in increased frequency savings without adversely affecting perfor- and hence performance. mance. The operating system can invoke DBS by selecting from a set of advanced configu- More “Ilities” ration and power interface (ACPI) perfor- Itanium processors are already known for mance states. their outstanding reliability, availability, and Because Montecito varies both voltage and serviceability (RAS), but Montecito expands frequency, systems can realize a significant the Itanium 2 RAS by providing protection power savings through this option with a for every major processor memory array from small performance impact; typically a 3 per- the register file7 to the TLBs and caches.11 cent power change has a 1 percent frequency Because Montecito parity protects even non- change. Additionally, Montecito lets original data-critical resources, when a pair of Mon- equipment manufacturers (OEMs) establish a tecito processors operate in lockstep, maximum power level less than 100W, which Montecito can indicate detected errors exter- means that Montecito will automatically nally and avoid fatal lockstep divergences. adjust its voltage and frequency to fit com- fortably in a 70W blade or a 100W high- Error reporting performance server. For Montecito, the key to providing reliabil- ity and availability is a partnership between the Other sources of power efficiency processor—with its array protection, error log- Foxton technology and DBS are not the ging, and precision—and the software stack— only innovations that enable Montecito’s which will correct, contain, and report the

18 IEEE MICRO failure. Montecito provides robust error detec- management requirements by allowing mul- tion and correction for critical data. For soft- tiple operating system instances to utilize the ware-based correction and data extraction, it underlying hardware at the same time with- provides hooks to support low-level array access out any knowledge at the operating-system and a robust set of information about the event level that the processor is shared. so that the software stack can mitigate the errors. Virtualization support in Montecito includes a fast mechanism to switch between guest and Data protection host modes, a virtual address protection scheme Montecito’s variety of structures and arrays to isolate the guests’ address space from that of results in an approach tailored to the struc- the VMM host, and privileged resource-access ture and its purpose. Some arrays use ECC; interceptions with a new virtualization fault to others use parity. Some error responses are abstract the processor from the OS. The huge immediate; some are deferred. Some errors are amount of processing capability a single Mon- machine correctable; others require mitiga- tecito provides means that virtual machine sup- tion from the software stack. port will allow multiple virtual partitions to Data criticality and structure size deter- coexist and share this power. mined the method of structure protection. These features, when combined with key Montecito replaced Itanium 2’s parity with firmware support, let VMM software abstract ECCs for key structures, for example, while the processor from the operating system such adding parity to many structures. The criti- that a single processor can run multiple oper- cality and size also enabled the use of Pellston ating systems while each operating system technology,11 which maintains availability in thinks it has complete control of the processor. the face of hard errors developed in the L3 cache while providing a mechanism for Mon- ntel designed the Montecito processor to tecito to scrub detected errors. Montecito also Iaddress the enterprise, technical, and inte- has parity protection for the register files (both ger workloads. Important attributes for suc- integer and floating point),7 temporary sys- cess in these areas include reliability, power tem interface data, and L2I data and tag bits. efficiency, and of course performance. While Parity protects some noncritical elements as Foxton technology advances Montecito as a well, such as branch-prediction structures and power-efficient design, the core, cache hier- some cache-replacement structures. Many archy, dual-thread and dual-core changes pro- designs ignore these errors, but systems that vide Montecito with top end performance. use socket-level lockstep to provide higher reli- Overall, supporting dual threads in Mon- ability and data protection require reporting tecito cost relatively little and we expect per- of such errors externally. If these systems know formance to increase 8 to 30 percent over other that an internally detected error caused a loss Itanium processors, depending on the system of lockstep, they can recover and so become and workload. Figure 7 shows relative perfor- available more often. System software can mance for integer, technical, and enterprise configure large arrays such as the L3 tag and workloads. The integer and technical work- data to correct errors in line, which does not loads essentially scale with frequency for the induce a loss of lockstep. previous Itanium processors, but Montecito’s core changes should allow it to perform better Virtualization support than raw frequency would provide. Mon- Customers continue to demand a better tecito’s enterprise performance stands out over return on their computational technology previous Itanium processors, in large part investment, particularly ease of management. because of Montecito’s dual threads and dual Ways to achieve that return include consoli- cores. Montecito, with its high performance, dated workloads, resource management, and low power and many capabilities, is well pre- optimized processor utilization. pared and suited for the server market. MICRO Montecito’s virtualization technology, a mixture of hardware and firmware, provides Acknowledgments a conduit for efficient virtual machine moni- We thank the Montecito design, verification tor (VMM) implementations. VMMs address and performance-evaluation teams for their

MARCH–APRIL 2005 19 HOT CHIPS 16

Protected, 128 Word Register Files on a Dual- Montecito core Itanium Architecture Processor,” Int’l Solid State Circuits Conf. Digest of Technical Papers, IEEE Press, 2005, pp. 382-383. Itanium 2 8. . Poirier et al., “Power and Temperature (9 Mbytes) Control on a 90 nm Itanium Architecture Itanium 2 Processor,” Int’l Solid State Circuits Conf. (6 Mbytes) 2005 Digest of Technical Papers, IEEE Enterprise Technical Press, 2005, pp. 304-305. Itanium 2 Integer 9. E. Fetzer et al., “Clock Distribution on a Dual- core, Multi-threaded Itanium Architecture 0123 Performance improvement factor Processor,” Int’l Solid State Circuits Conf. Digest of Technical Papers, IEEE Press, Figure 7. Estimated performance of Montecito relative to 2005, pp. 292-293. previous Itanium 2 processors. 10. T. Fischer et al., “A 90 nm Variable Frequency Clock System for a Power-Managed Itanium Architecture Processor,” Int’l Solid State contributions in successfully delivering the first Circuits Conf. Digest of Technical Papers, dual-thread, dual-core Itanium processor. The IEEE Press, 2005, pp. 294-295. work we describe in this article would not have 11. C. McNairy and J. Mayfield, “Montecito been possible without the contribution of each Error Protection and Mitigation,” to be individual on those teams. published in Proc. Workshop High- Performance Computing Reliability Issues, References IEEE CS Press, 2005. 1. S. Rusu et al., “Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache,” Cameron McNairy is an architect for the IEEE Micro, vol. 24, no. 2, Mar.-Apr. 2004, Montecito program at Intel. He was also a pp. 10-16. microarchitect for the Itanium 2 processor, 2. S. Naffziger et al., “The Implementation of contributing to its design and final validation. a 2-core, Multi-threaded Itanium Family In future endeavors, McNairy plans to focus Processor,” Int’l Solid State Circuits Conf. on performance; reliability, availability, and Digest of Technical Papers, IEEE Press, Feb serviceability; and system interface issues in 2005, pp. 182-183. Itanium . He received a BSEE 3. C. McNairy and D. Soltis, “Itanium 2 and an MSEE from Brigham Young Univer- Processor ,” IEEE Micro, sity and is a member of the IEEE. vol. 23, no. 2, Mar.-Apr. 2003, pp. 44-55. 4. M. Mock et al., “An Empirical Study of Data Rohit Bhatia is an Itanium processor microar- Speculation Use on the Intel Itanium 2 chitect at Intel. His research interests include Processor,” to be published in Proc. , Itanium processor Workshop Interaction Between Compilers microarchitecture, and the functional verifi- and Computer Architecture, IEEE CS Press, cation of . Bhatia has an 2005. MSEE from the University of Minnesota and 5. J. Wuu et al., “The Asynchronous 24 MB an MS in engineering management from Port- On-Chip Level 3 Cache for a Dual-core land State University. Itanium Architecture Processor,” Int’l Solid State Circuits Conf. Digest of Technical Direct questions and comments about this Papers, IEEE Press, Feb 2004, pp. 488-487. article to Cameron McNairy, Intel, 3400 East 6. L. Spracklen et al., “Effective Instruction Harmony Rd., HRM1-1, Fort Collins, CO Prefetching in Chip Multiprocessors for 80528; [email protected]. Modern Commercial Applications,” Proc. High-Performance Computer Architecture, For further information on this or any other IEEE CS Press, 2005, pp. 225-236. computing topic, visit our Digital Library at 7. E. Fetzer et al., “The Multi-threaded, Parity http://www.computer.org/publications/dlib.

20 IEEE MICRO