NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

THE NIAGARA PROCESSOR IMPLEMENTS A -RICH ARCHITECTURE

DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL

SERVER APPLICATIONS. THE HARDWARE SUPPORTS 32 THREADS WITH A

MEMORY SUBSYSTEM CONSISTING OF AN ON-BOARD CROSSBAR, LEVEL-2

CACHE, AND MEMORY CONTROLLERS FOR A HIGHLY INTEGRATED DESIGN

THAT EXPLOITS THE THREAD-LEVEL PARALLELISM INHERENT TO SERVER

APPLICATIONS, WHILE TARGETING LOW LEVELS OF POWER CONSUMPTION.

Over the past two decades, micro- application performance by improving processor designers have focused on improv- throughput, the total amount of work done ing the performance of a single thread in a across multiple threads of execution. This is desktop processing environment by increas- especially effective in commercial server appli- ing frequencies and exploiting instruction cations such as databases1 and Web services,2 level parallelism (ILP) using techniques such which tend to have workloads with large as multiple instruction issue, out-of-order amounts of thread level parallelism (TLP). issue, and aggressive branch prediction. The In this article, we present the Niagara Poonacha Kongetira emphasis on single-thread performance has processor’s architecture. This is an entirely shown diminishing returns because of the lim- new implementation of the Sparc V9 archi- Kathirgamar Aingaran itations in terms of latency to main memory tectural specification, which exploits large and the inherently low ILP of applications. amounts of on-chip parallelism to provide Kunle Olukotun This has led to an explosion in microproces- high throughput. Niagara supports 32 hard- sor design complexity and made power dissi- ware threads by combining ideas from chip pation a major concern. multiprocessors3 and fine-grained multi- For these reasons, Sun Microsystems’ Nia- threading.4 Other studies5 have also indicated gara processor takes a radically different the significant performance gains possible approach to microprocessor design. Instead using this approach on multithreaded work- of focusing on the performance of single or loads. The parallel execution of many threads dual threads, Sun optimized Niagara for mul- effectively hides memory latency. However, tithreaded performance in a commercial serv- having 32 threads places a heavy demand on er environment. This approach increases the memory system to support high band-

0272-1732/05/$20.00 ” 2005 IEEE Published by the IEEE computer Society 21

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. HOT CHIPS 16

Table 1. Commercial server applications.

Instruction-level Thread-level Working Data Benchmark Application category parallelism parallelism set sharing Web99 Web server Low High Large Low JBB Java application server Low High Large Medium TPC-C Transaction processing Low High Large High SAP-2T Enterprise resource planning Medium High Medium Medium SAP-3T Enterprise resource planning Low High Large High TPC-H Decision support system High High Large Medium

width. To provide this bandwidth, a crossbar a server running these applications is the sus- interconnects scheme routes memory refer- tained throughput of client requests. ences to a banked on-chip level-2 cache that Furthermore, the deployment of servers all threads share. Four independent on-chip commonly takes place in high compute den- memory controllers provide in excess of 20 sity installations such as data centers, where Gbytes/s of bandwidth to memory. supplying power and dissipating server- Exploiting TLP also lets us improve per- generated heat are very significant factors in formance significantly without pushing the the center’s cost of operating. Experience at envelope on CPU clock frequency. This and Google shows a representative power density the sharing of CPU pipelines among multi- requirement of 400 to 700 W/sq. foot for ple threads enable an area- and power- racked server clusters.2 This far exceeds the typ- efficient design. Designers expect Niagara to ical power densities of 70 to 150 W/foot2 sup- dissipate about 60 W of power, making it ported by commercial data centers. It is very attractive for high compute density envi- possible to reduce power consumption by sim- ronments. In data centers, for example, power ply running the ILP processors in server clus- supply and air conditioning costs have ters at lower clock frequencies, but the become very significant. Data center racks proportional loss in performance makes this often cannot hold a complete complement of less desirable. This situation motivates the servers because this would exceed the rack’s requirement for commercial servers to improve power supply envelope. performance per watt. These requirements We designed Niagara to run the Solaris have not been efficiently met using machines operating system, and existing Solaris applica- optimized for single-thread performance. tions will run on Niagara systems without Commercial server applications tend to have modification. To application software, a Nia- low ILP because they have large working sets gara processor will appear as 32 discrete proces- and poor locality of reference on memory sors with the OS layer abstracting away the access; both contribute to high cache-miss hardware sharing. Many multithreaded appli- rates. In addition, data-dependent branches cations currently running on symmetric mul- are difficult to predict, so the processor must tiprocessor (SMP) systems should realize discard work done on the wrong path. Load- performance improvements. This is consistent load dependencies are also present, and are not with observations from previous multithread- detectable in hardware at issue time, resulting ed-processor development at Sun6,7 and from in discarded work. The combination of low Niagara chips and systems, which are under- available ILP and high cache-miss rates caus- going bring-up testing in the laboratory. es memory access time to limit performance. Recently, the movement of many retail and Therefore, the performance advantage of using business processes to the Web has triggered a complex ILP processor over a single-issue the increasing use of commercial server appli- processor is not significant, while the ILP cations (Table 1). These server applications processor incurs the costs of high power and exhibit large degrees of client request-level par- complexity, as Figure 1 shows. allelism, which servers using multiple threads However, server applications tend to have can exploit. The key performance metric for large amounts of TLP. Therefore, shared-

22 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. memory machines with discrete single-thread- Single ed processors and coherent interconnect have issue CMC M C M tended to perform well because they exploit ILP CMC M C M TLP. However, the use of an SMP composed of multiple processors designed to exploit ILP TLP CM is neither power efficient nor cost-efficient. A (on shared Time saved CM more efficient approach is to build a machine single issue pipeline) using simple cores aggregated on a single die, CM with a shared on-chip cache and high band- width to large off-chip memory, thereby Memory latency Compute latency aggregating an SMP server on a chip. This has the added benefit of low-latency communi- Figure 1. Behavior of processors optimized for TLP and ILP on cation between the cores for efficient data commercial server workloads. In comparison to the single- sharing in commercial server applications. issue machine, the ILP processor mainly reduces compute time, so memory access time dominates application perfor- Niagara overview mance. In the TLP case, multiple threads share a single-issue The Niagara approach to increasing pipeline, and overlapped execution of these threads results in throughput on commercial server applications higher performance for a multithreaded application. involves a dramatic increase in the number of threads supported on the processor and a memory subsystem scaled for higher band- of bandwidth. A two-entry queue is available widths. Niagara supports 32 threads of exe- for each source-destination pair, and it can cution in hardware. The architecture queue up to 96 transactions each way in the organizes four threads into a thread group; the crossbar. The crossbar also provides a port for group shares a processing pipeline, referred to communication with the I/O subsystem. as the Sparc pipe. Niagara uses eight such Arbitration for destination ports uses a sim- thread groups, resulting in 32 threads on the ple age-based priority scheme that ensures fair CPU. Each SPARC pipe contains level-1 scheduling across all requestors. The crossbar caches for instructions and data. The hard- is also the point of memory ordering for the ware hides memory and pipeline stalls on a machine. given thread by scheduling the other threads The memory interface is four channels of in the group onto the SPARC pipe with a zero dual-data rate 2 (DDR2) DRAM, supporting cycle switch penalty. Figure 1 schematically a maximum bandwidth in excess of 20 shows how reusing the shared processing Gbytes/s, and a capacity of up to 128 Gbytes. pipeline results in higher throughput. Figure 2 shows a block diagram of the Nia- The 32 threads share a 3-Mbyte level-2 gara processor. cache. This cache is 4-way banked and pipelined for bandwidth; it is 12-way set- Sparc pipeline associative to minimize conflict misses from Here we describe the Sparc pipe implemen- the many threads. Commercial server code tation, which supports four threads. Each has data sharing, which can lead to high thread has a unique set of registers and instruc- coherence miss rates. In conventional SMP tion and store buffers. The thread group shares systems using discrete processors with coher- the L1 caches, translation look-aside buffers ent system interconnects, coherence misses go (TLBs), execution units, and most pipeline out over low-frequency off-chip buses or links, registers. We implemented a single-issue and can have high latencies. The Niagara pipeline with six stages (fetch, thread select, design with its shared on-chip cache elimi- decode, execute, memory, and write back). nates these misses and replaces them with low- In the fetch stage, the instruction cache and latency shared-cache communication. instruction TLB (ITLB) are accessed. The fol- The crossbar interconnect provides the lowing stage completes the cache access by communication link between Sparc pipes, L2 selecting the way. The critical path is set by cache banks, and other shared resources on the 64-entry, fully associative ITLB access. A the CPU; it provides more than 200 Gbytes/s thread-select multiplexer determines which of

MARCH–APRIL 2005 23

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. HOT CHIPS 16

Sparc pipe DDR 4-way MT Dram control Channel 0 L2 B0 Sparc pipe 4-way MT

Sparc pipe DDR 4-way MT Dram control Channel 1 L2 B1 Sparc pipe 4-way MT

Sparc pipe DDR 4-way MT Crossbar Dram control Channel 2 L2 B2 Sparc pipe 4-way MT

Sparc pipe DDR 4-way MT Dram control Channel 3 L2 B3 Sparc pipe 4-way MT

I/O and shared functions I/O interface

Figure 2. Niagara block diagram.

the four thread program counters (PC) should instructions before the register file is updat- perform the access. The pipeline fetches two ed. ALU and shift instructions have single- instructions each cycle. A predecode bit in the cycle latency and generate results in the cache indicates long-latency instructions. execute stage. Multiply and divide operations In the thread-select stage, the thread-select are long latency and cause a thread switch. multiplexer chooses a thread from the avail- The load store unit contains the data TLB able pool to issue an instruction into the (DTLB), data cache, and store buffers. The downstream stages. This stage also maintains DTLB and data cache access take place in the the instruction buffers. Instructions fetched memory stage. Like the fetch stage, the criti- from the cache in the fetch stage can be insert- cal path is set by the 64-entry, fully associa- ed into the instruction buffer for that thread tive DTLB access. The load-store unit if the downstream stages are not available. contains four 8-entry store buffers, one per Pipeline registers for the first two stages are thread. Checking the physical tags in the store replicated for each thread. buffer can indicate read after write (RAW) Instructions from the selected thread go hazards between loads and stores. The store into the decode stage, which performs instruc- buffer supports the bypassing of data to a load tion decode and register file access. The sup- to resolve RAW hazards. The store buffer tag ported execution units include an arithmetic check happens after the TLB access in the logic unit (ALU), shifter, multiplier, and a early part of write back stage. Load data is divider. A bypass unit handles instruction available for bypass to dependent instructions results that must be passed to dependent late in the write back stage. Single-cycle

24 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. Fetch Thread select Decode Execute Memory Writeback

Register file × 4

ICache Instruction DCache ITLB buffer × 4 Thread ALU MUL DTLB Crossbar select Decode store interface Mux Shifter × DIV buffers 4

Instruction type Thread selects Thread Misses select logic Traps and interrupts Resource conflicts PC Thread logic select × 4 Mux

Figure 3. Sparc pipeline block diagram. Four threads share a six-stage single-issue pipeline with local instruction and data caches. Communication with the rest of the machine occurs through the crossbar interface. instructions such as ADD will update the reg- Pipeline interlocks and scheduling ister file in the write back stage. For single-cycle instructions such as ADD, The thread-select logic decides which thread Niagara implements full bypassing to younger is active in a given cycle in the fetch and thread- instructions from the same thread to resolve select stages. As Figure 3 shows, the thread-select RAW dependencies. As mentioned before, load signals are common to fetch and thread-select instructions have a three-cycle latency before stages. Therefore, if the thread-select stage the results of the load are visible to the next chooses a thread to issue an instruction to the instruction. Such long-latency instructions can decode stage, the F stage also selects the same cause pipeline hazards; resolving them requires instruction to access the cache. The thread-select stalling the corresponding thread until the haz- logic uses information from various pipeline ard clears. So, in the case of a load, the next stages to decide when to select or deselect a instruction from the same thread waits for two thread. For example, the thread-select stage can cycles for the hazards to clear. determine instruction type using a predecode In a multithreaded pipeline, threads com- bit in the instruction cache, while some traps peting for shared resources also encounter are only detectable in later pipeline stages. structural hazards. Resources such as the ALU Therefore, instruction type can cause deselec- that have a one-instruction-per-cycle through- tion of a thread in the thread-select stage, while put present no hazards, but the divider, which a late trap detected in the memory stage needs has a throughput of less than one instruction to flush all younger instructions from the thread per cycle, presents a scheduling problem. In and deselect itself during trap processing. this case, any thread that must execute a DIV

MARCH–APRIL 2005 25

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. HOT CHIPS 16

Cycles become unavailable because of pipeline stalls such as cache misses, traps, and resource con- flicts. The thread scheduler assumes that loads St0-sub Dt0-sub Et0-sub Mt0-sub Wt0-sub are cache hits, and can therefore issue a depen- dent instruction from the same thread specu-

Ft0-sub St1-sub Dt1-sub Et01ub Mt1-sub Wt1-sub latively. However, such a speculative thread is assigned a lower priority for instruction issue as compared to a thread that can issue a non- Ft1-ld St2-ld Dt2-ld Et2-ld Mt2-ld Wt2-ld Instructions speculative instruction. Figure 4 indicates the operation when all

Ft2-br St3-add Dt3-add Et3-add Mt3-add threads are available. In the figure, reading from left to right indicates the progress of an instruction through the pipeline. Reading F S D E t3-add to-add to-add to-add from top to bottom indicates new instructions fetched into the pipe from the instruction

Figure 4. Thread selection: all threads available. cache. The notation St0-sub refers to a Subtract instruction from thread 0 in the S stage of the pipe. In the example, the t0-sub is issued down Cycles the pipe. As the other three threads become available, the thread state machine selects thread1 and deselects thread0. In the second St0-ld Dt0-ld Et0-ld Mt0-ld Wt0-ld cycle, similarly, the pipeline executes the t1- sub and selects t2-ld (load instruction from

Ft0-add St1-sub Dt1-sub Et01ub Mt1-sub Wt1-sub thread 2) for issue in the following cycle. When t3-add is in the S stage, all threads have been executed, and for the next cycle the pipeline Ft1-ld St1-ld Dt1-ld Et1-ld Mt21 Wt21 Instructions selects the least recently used thread, thread0. When the thread-select stage chooses a thread

Ft1-br St0-add Dt0-add Et0-add Mt0-add for execution, the fetch stage chooses the same thread for instuction cache access. Figure 5 indicates the operation when only two threads are available. Here thread0 and thread1 are available, while thread2 and Figure 5. Thread selection: only two threads available. The ADD instruction thread3 are not. The t0-ld in the thread-select from thread0 is speculatively switched into the pipeline before the hit/miss stage in the example is a long-latency opera- for the load instruction has been determined. tion. Therefore it causes the deselection of thread0. The t0-ld itself, however, issues down the pipe. In the second cycle, since thread1 is instruction has to wait until the divider is free. available, the thread scheduler switches it in. The thread scheduler guarantees fairness in At this time, there are no other threads avail- accessing the divider by giving priority to the able and the t1-sub is a single-cycle opera- least recently executed thread. Although the tion, so thread1 continues to be selected for divider is in use, other threads can use free the next cycle. The subsequent instruction is resources such as the ALU, load-store unit, a t1-ld and causes the deselection of thread1 and so on. for the fourth cycle. At this time only thread0 is speculatively available and therefore can be Thread selection policy selected. If the first t0-ld was a hit, data can The thread selection policy is to switch bypass to the dependent t0-add in the exe- between available threads every cycle, giving cute stage. If the load missed, the pipeline priority to the least recently used thread. flushes the subsequent t0-add to the thread- Threads can become unavailable because of select stage instruction buffer, and the long-latency instructions such as loads, instruction reissues when the load returns branches, and multiply and divide. They also from the L2 cache.

26 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. Integer register file Architectural set Working set The integer register file has three read and (compact SRAM cells) (fast RF cells) two write ports. The read ports correspond to those of a single-issue machine; the third read port handles stores and a few other three Outs[0-7] source instructions. The two write ports han- w(n+1) dle single-cycle and long-latency operations. Long-latency operations from various threads Call Locals[0-7] within the group (load, multiply, and divide) can generate results simultaneously. These Outs[0-7] Ins[0-7] Outs[0-7] instructions will arbitrate for the long-laten- w(n) Locals[0-7] cy port of the register file for write backs. Sparc Locals[0-7] V9 architecture specifies the register window Outs[0-7] Ins[0-7] Transfer implementation shown in Figure 6. A single port Ins[0-7] window consists of eight in, local, and out reg- w(n−1) Locals[0-7] isters; they are all visible to a thread at a given Return time. The out registers of a window are Ins[0-7] addressed as the in registers of the next sequential window, but are the same physical Read/write registers. Each thread has eight register win- access from pipe dows. Four such register sets support each of the four threads, which do not share register space among themselves. The register file con- Figure 6. Windowed register file. W(n -1), w(n), and w(n + 1) are neighboring sists of a total of 640 64-bit registers and is a register windows. Pipeline accesses occur in the working set while window 5.7 Kbyte structure. Supporting the many changing events cause data transfer between the working set and old or new multiported registers can be a major challenge architectural windows. This structure is replicated for each of four threads, from both area and access time considerations. We have chosen an innovative implementa- tion to maintain a compact footprint and a thread can read the register file in a given cycle. single-cycle access time. Procedure calls can request a new window, Memory subsystem in which case the visible window slides up, The L1 instruction cache is 16 Kbyte, with the old outputs becoming the new 4-way set-associative with a block size of 32 inputs. Return from a call causes the window bytes. We implement a random replacement to slide down. We take advantage of this char- scheme for area savings, without incurring sig- acteristic to implement a compact register file. nificant performance cost. The instruction The set of registers visible to a thread is the cache fetches two instructions each cycle. If working set, and we implement it using fast the second instruction is useful, the instruc- register file cells. The complete set of registers tion cache has a free slot, which the pipeline is the architectural set; we implement it using can use to handle a line fill without stalling. compact six-transistor SRAM cells. A trans- The L1 data cache is 8 Kbytes, 4-way set- fer port links the two register sets. A window associative with a line size of 16 bytes, and changing event triggers deselection of the implements a write-through policy. Even thread and the transfer of data between the though the L1 caches are small, they signifi- old and new windows. Depending on the cantly reduce the average memory access time event type, the data transfer takes one or two per thread with miss rates in the range of 10 cycles. When the transfer is complete, the percent. Because commercial server applica- thread can be selected again. This data trans- tions tend to have large working sets, the L1 fer is independent of operations to registers caches must be much larger to achieve signif- from other threads, therefore operations on icantly lower miss rates, so this sort of trade- the register file from other threads can con- off is not favorable for area. However, the four tinue. In addition, the registers of all threads threads in a thread group effectively hide the share the read circuitry because only one latencies from L1 and L2 misses. Therefore,

MARCH–APRIL 2005 27

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. HOT CHIPS 16

Hardware multithreading primer A multithreaded processor is one that allows more than one thread of sharing a next-level cache and system interfaces. Here, each processor execution to exist on the CPU at the same time. To software, a dual-thread- core executes instructions from a thread independently of the other thread, ed processor looks like two distinct CPUs, and the operating system takes and they interact through shared memory. This has been an attractive advantage of this by scheduling two threads of execution on it. In most option for chip designers because of the availability of cores from earlier cases, though, the threads share CPU pipeline resources, and hardware processor generations, which, when shrunk down to present-day process uses various means to manage the allocation of resources to these threads. technologies, are small enough for aggregation onto a single die. The industry uses several terms to describe the variants of multi- threading implemented in hardware; we show some in Figure A. In a single-issue, single-thread machine (included here for reference), hardware does not control thread scheduling on the pipeline. Single- (a) thread machines support a single context in hardware; therefore a thread switch by the operating system incurs the overhead of saving and retriev- ing thread states from memory. (b) Switch overhead Coarse-grained multithreading, or switch-on-event multithreading, is when a thread has full use of the CPU resource until a long-latency event such as miss to DRAM occurs, in which case the CPU switches to anoth- (c) er thread. Typically, this implementation has a context switch cost asso- ciated with it, and thread switches occur when event latency exceeds a specified threshold. Fine grained multithreading is sometimes called interleaved multi- (d) threading; in it, thread selection typically happens at a cycle boundary. The selection policy can be simply to allocate resources to a thread on a round-robin basis with threads becoming unavailable for longer periods of time on events like memory misses. The preceding types time slice the processor resources, so these imple- (e) Thread 0 mentations are also called time-sliced or vertical multithreaded. An Thread 1 approach that schedules instructions from different threads on different Unused cycle functional units at the same time is called simultaneous multithreading (SMT) or, alternately, horizontal multithreading. SMT typically works on Figure A. Variants of multithreading implementation hardware: superscalar processors, which have hardware for scheduling across mul- single-issue, single-thread pipeline (a); single-issue, 2 threads, tiple pipes. coarse-grain threading (b); single-issue, 2 threads, fine-grain Another type of implementation is chip multiprocessing, which simply threading (c); dual-issue, 2 threads, simultaneous threading (d); calls for instantiating single-threaded processor cores on a die, perhaps and single-issue, 2 threads, chip multiprocessing (e).

the cache sizes are a good trade-off between line and data is then returned to the L1 cache. miss rates, area, and the ability of other threads The directory thus maintains a sharers list at in the group to hide latency. L1-line granularity. A subsequent store from Niagara uses a simple cache coherence pro- a different or same L1 cache will look up the tocol. The L1 caches are write through, with directory and queue up invalidates to the L1 allocate on load and no-allocate on stores. L1 caches that have the line. Stores do not update lines are either in valid or invalid states. The the local caches until they have updated the L2 cache maintains a directory that shadows L2 cache. During this time, the store can pass the L1 tags. The L2 cache also interleaves data data to the same thread but not to other across banks at a 64-byte granularity. A load threads; therefore, a store attains global visi- that missed in an L1 cache (load miss) is deliv- bility in the L2 cache. The crossbar establish- ered to the source bank of the L2 cache along es memory order between transactions from with its replacement way from the L1 cache. the same and different L2 banks, and guaran- There, the load miss address is entered in the tees delivery of transactions to L1 caches in the corresponding L1 tag location of the directo- same order. The L2 cache follows a copy-back ry, the L2 cache is accessed to get the missing policy, writing back dirty evicts and dropping

28 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. clean evicts. Direct memory access from I/O tecture Based on Single-Chip Multiprocess- devices are ordered through the L2 cache. Four ing,” Proc. 27th Ann. Int’l Symp. Computer channels of DDR2 DRAM provide in excess Architecture (ISCA 00), IEEE CS Press, 2000, of 20 Gbytes/s of memory bandwidth. pp. 282-293. 6. S. Kapil, H. McGhan, and J. Lawrendra, “A iagara systems are presently undergoing test- Chip Multithreaded Processor for Network- Ning and bring up. We have run existing Facing Workloads,” IEEE Micro, vol. 24, no. multithreaded application software written for 2, Mar.-Apr. 2004, pp. 20-30. Solaris without modification on laboratory sys- 7. J. Hart et al., “Implementation of a 4th-Gen- tems. The simple pipeline requires no special com- eration 1.8 GHz Dual Core Sparc V9 Micro- piler optimizations for instruction scheduling. On processor,” Proc. Int’l Solid-State Circuits real commercial applications, we have observed Conf. (ISSCC 05), IEEE Press, 2005, the performance in lab systems to scale almost lin- http://www.isscc.org/isscc/2005/ap/ early with the number of threads, demonstrating ISSCC2005AdvanceProgram.pdf. that there are no bandwidth bottlenecks in the design. The Niagara processor represents the first Poonacha Kongetira is a director of engi- in a line of microprocessors that Sun designed neering at Sun Microsystems and was part of with many hardware threads to provide high the development team for the Niagara proces- throughput and high performance per watt on sor. His research interests include computer commercial server applications. The availability architecture and methodologies for SoC of a thread-rich architecture opens up new design. Kongetira has a MS from Purdue Uni- avenues for developers to efficiently increase appli- versity and BS from Birla Institute of Tech- cation performance. This architecture is a para- nology and Science, both in electrical digm shift in the way that microprocessors have engineering. He is a member of the IEEE. been designed thus far. MICRO Kathirgamar Aingaran is a senior staff engineer Acknowledgments at Sun Microsystems and was part of the devel- This article represents the work of the very opment team for Niagara. His research inter- talented and dedicated Niagara development ests include architectures for low power and low team, and we are privileged to present their design complexity. Aingaran has an MS from work. and a BS from the Univer- sity of Southern California, both in electrical References engineering. He is a member of the IEEE. 1. S.R. Kunkel et al., “A Performance Method- ology for Commercial Servers,” IBM J. Kunle Olukotun is an associate professor of Research and Development, vol. 44, no. 6, and 2000, pp. 851-872. at Stanford University. His research interests 2. L. Barroso, J. Dean, and U. Hoezle, “Web include , parallel pro- Search for a Planet: The Architecture of the gramming environments, and formal hard- Google Cluster,” IEEE Micro, vol 23, no. 2, ware verification. Olukotun has a PhD in Mar.-Apr. 2003, pp. 22-28. computer science and engineering from the 3. K. Olukotun et al., “The Case for a Single . He is a member of Chip Multiprocessor,” Proc. 7thConf. Archi- the IEEE and the ACM. tectural Support for Programming Lan- guages and Operating Systems (ASPLOS Direct questions and comments about this VII), 1996, pp. 2-11. article to Poonacha Kongetira,MS: USUN05- 4. J. Laudon, A. Gupta, and M. Horowitz, “Inter- 215, 910 Hermosa Court, Sunnyvale, CA leaving: A Multithreading Technique Target- 94086; [email protected]. ing Multiprocessors and Workstations,” Proc. 6th Int’l Conf. Architectural Support for Pro- gramming Languages and Operating Systems For further information on this or any other (ASPLOS VI), ACM Press, 1994, pp. 308-316. computing topic, visit our Digital Library at 5. L. Barroso et al., “Piranha: A Scalable Archi- http://www.computer.org/publications/dlib.

MARCH–APRIL 2005 29

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply.