Niagara: a 32-Way Multithreaded Sparc Processor

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL SERVER APPLICATIONS. THE HARDWARE SUPPORTS 32 THREADS WITH A MEMORY SUBSYSTEM CONSISTING OF AN ON-BOARD CROSSBAR, LEVEL-2 CACHE, AND MEMORY CONTROLLERS FOR A HIGHLY INTEGRATED DESIGN THAT EXPLOITS THE THREAD-LEVEL PARALLELISM INHERENT TO SERVER APPLICATIONS, WHILE TARGETING LOW LEVELS OF POWER CONSUMPTION. Over the past two decades, micro- application performance by improving processor designers have focused on improv- throughput, the total amount of work done ing the performance of a single thread in a across multiple threads of execution. This is desktop processing environment by increas- especially effective in commercial server appli- ing frequencies and exploiting instruction cations such as databases1 and Web services,2 level parallelism (ILP) using techniques such which tend to have workloads with large as multiple instruction issue, out-of-order amounts of thread level parallelism (TLP). issue, and aggressive branch prediction. The In this article, we present the Niagara Poonacha Kongetira emphasis on single-thread performance has processor’s architecture. This is an entirely shown diminishing returns because of the lim- new implementation of the Sparc V9 archi- Kathirgamar Aingaran itations in terms of latency to main memory tectural specification, which exploits large and the inherently low ILP of applications. amounts of on-chip parallelism to provide Kunle Olukotun This has led to an explosion in microproces- high throughput. Niagara supports 32 hard- sor design complexity and made power dissi- ware threads by combining ideas from chip Sun Microsystems pation a major concern. multiprocessors3 and fine-grained multi- For these reasons, Sun Microsystems’ Nia- threading.4 Other studies5 have also indicated gara processor takes a radically different the significant performance gains possible approach to microprocessor design. Instead using this approach on multithreaded work- of focusing on the performance of single or loads. The parallel execution of many threads dual threads, Sun optimized Niagara for mul- effectively hides memory latency. However, tithreaded performance in a commercial serv- having 32 threads places a heavy demand on er environment. This approach increases the memory system to support high band- 0272-1732/05/$20.00 ” 2005 IEEE Published by the IEEE computer Society 21 Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. HOT CHIPS 16 Table 1. Commercial server applications. Instruction-level Thread-level Working Data Benchmark Application category parallelism parallelism set sharing Web99 Web server Low High Large Low JBB Java application server Low High Large Medium TPC-C Transaction processing Low High Large High SAP-2T Enterprise resource planning Medium High Medium Medium SAP-3T Enterprise resource planning Low High Large High TPC-H Decision support system High High Large Medium width. To provide this bandwidth, a crossbar a server running these applications is the sus- interconnects scheme routes memory refer- tained throughput of client requests. ences to a banked on-chip level-2 cache that Furthermore, the deployment of servers all threads share. Four independent on-chip commonly takes place in high compute den- memory controllers provide in excess of 20 sity installations such as data centers, where Gbytes/s of bandwidth to memory. supplying power and dissipating server- Exploiting TLP also lets us improve per- generated heat are very significant factors in formance significantly without pushing the the center’s cost of operating. Experience at envelope on CPU clock frequency. This and Google shows a representative power density the sharing of CPU pipelines among multi- requirement of 400 to 700 W/sq. foot for ple threads enable an area- and power- racked server clusters.2 This far exceeds the typ- efficient design. Designers expect Niagara to ical power densities of 70 to 150 W/foot2 sup- dissipate about 60 W of power, making it ported by commercial data centers. It is very attractive for high compute density envi- possible to reduce power consumption by sim- ronments. In data centers, for example, power ply running the ILP processors in server clus- supply and air conditioning costs have ters at lower clock frequencies, but the become very significant. Data center racks proportional loss in performance makes this often cannot hold a complete complement of less desirable. This situation motivates the servers because this would exceed the rack’s requirement for commercial servers to improve power supply envelope. performance per watt. These requirements We designed Niagara to run the Solaris have not been efficiently met using machines operating system, and existing Solaris applica- optimized for single-thread performance. tions will run on Niagara systems without Commercial server applications tend to have modification. To application software, a Nia- low ILP because they have large working sets gara processor will appear as 32 discrete proces- and poor locality of reference on memory sors with the OS layer abstracting away the access; both contribute to high cache-miss hardware sharing. Many multithreaded appli- rates. In addition, data-dependent branches cations currently running on symmetric mul- are difficult to predict, so the processor must tiprocessor (SMP) systems should realize discard work done on the wrong path. Load- performance improvements. This is consistent load dependencies are also present, and are not with observations from previous multithread- detectable in hardware at issue time, resulting ed-processor development at Sun6,7 and from in discarded work. The combination of low Niagara chips and systems, which are under- available ILP and high cache-miss rates caus- going bring-up testing in the laboratory. es memory access time to limit performance. Recently, the movement of many retail and Therefore, the performance advantage of using business processes to the Web has triggered a complex ILP processor over a single-issue the increasing use of commercial server appli- processor is not significant, while the ILP cations (Table 1). These server applications processor incurs the costs of high power and exhibit large degrees of client request-level par- complexity, as Figure 1 shows. allelism, which servers using multiple threads However, server applications tend to have can exploit. The key performance metric for large amounts of TLP. Therefore, shared- 22 IEEE MICRO Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on January 28, 2010 at 22:25 from IEEE Xplore. Restrictions apply. memory machines with discrete single-thread- Single ed processors and coherent interconnect have issue CMC M C M tended to perform well because they exploit ILP CMC M C M TLP. However, the use of an SMP composed of multiple processors designed to exploit ILP TLP CM is neither power efficient nor cost-efficient. A (on shared Time saved CM more efficient approach is to build a machine single issue pipeline) using simple cores aggregated on a single die, CM with a shared on-chip cache and high bandwidth to large off-chip memory, thereby Memory latency Compute latency aggregating an SMP server on a chip. This has the added benefit of low-latency communi- Figure 1. Behavior of processors optimized for TLP and ILP on cation between the cores for efficient data commercial server workloads. In comparison to the single- sharing in commercial server applications. issue machine, the ILP processor mainly reduces compute time, so memory access time dominates application perfor- Niagara overview mance. In the TLP case, multiple threads share a single-issue The Niagara approach to increasing pipeline, and overlapped execution of these threads results in throughput on commercial server applications higher performance for a multithreaded application. involves a dramatic increase in the number of threads supported on the processor and a memory subsystem scaled for higher band- of bandwidth. A two-entry queue is available widths. Niagara supports 32 threads of exe- for each source-destination pair, and it can cution in hardware. The architecture queue up to 96 transactions each way in the organizes four threads into a thread group; the crossbar. The crossbar also provides a port for group shares a processing pipeline, referred to communication with the I/O subsystem. as the Sparc pipe. Niagara uses eight such Arbitration for destination ports uses a sim- thread groups, resulting in 32 threads on the ple age-based priority scheme that ensures fair CPU. Each SPARC pipe contains level-1 scheduling across all requestors. The crossbar caches for instructions and data. The hard- is also the point of memory ordering for the ware hides memory and pipeline stalls on a machine. given thread by scheduling the other threads The memory interface is four channels of in the group onto the SPARC pipe with a zero dual-data rate 2 (DDR2) DRAM, supporting cycle switch penalty. Figure 1 schematically a maximum bandwidth in excess of 20 shows how reusing the shared processing Gbytes/s, and a capacity of up to 128 Gbytes. pipeline results in higher throughput. Figure 2 shows a block diagram of the Nia- The 32 threads share a 3-Mbyte level-2 gara processor. cache. This cache is 4-way banked and pipelined for bandwidth; it is 12-way set- Sparc pipeline associative to minimize conflict misses from Here we describe the Sparc pipe implemen- the many threads. Commercial server code tation, which supports four threads. Each has data sharing, which can lead to high thread has a unique set of registers and instruc- coherence miss rates. In conventional SMP tion and store buffers. The thread group shares systems using discrete processors with coher- the L1 caches, translation look-aside buffers ent system interconnects, coherence misses go (TLBs), execution units, and most pipeline out over low-frequency off-chip buses or links, registers. We implemented a single-issue and can have high latencies.

Niagara: a 32-Way Multithreaded Sparc Processor

Spreading Excellence Report

A Compiler-Compiler for DSL Embedding

(URMD) Grad Cohort Workshop

Entrepreneurship Opportunities & Skills

Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory

Ecpe Connections

Kunle Olukotun Cadence Design Systems Professor and Professor of Electrical Engineering

Energy-Efficient Abundant-Data Computing: the N3XT 1,000X

News from SCS Networks

Multicore Cpus: Processor Proliferation - IEEE Spectrum 2/15/11 1:51 PM

University of Copenhagen

10003.Demeyerromain.2604.Pdf (0.1