Niagara: a 32-Way Multithreaded SPARC

P. Kongetira, K. Aingaran, K.Olokotun

Presented by Bogdan Romanescu Goal

• Commercial server applications: – High thread level parallelism (TLP) • Large numbers of parallel client requests – Low instruction level parallelism (ILP) • High cache miss rates • Many unpredictable branches • Frequent load-load dependencies • Power, cooling, and space are major concerns for data centers Sun’s Solution • UltraSPARC T1 processor • “the highest-throughput and most eco- responsible processor ever created”®

• Multicore • Fine-grain multithreading within core • Simple pipelines • Small L1 cache • Shared L2 • Metric: Performance/Watt Architecture Sparc pipe

• UltraSPARC II style • Single issue 6 stage: F, S, D, E, M, W • Shared units: – L1 $ – TLB – X units – pipe registers • Hazards: –Data – Structural Integer Register file

• One register file / thread • SPARC window: in, out, local registers • Highly integrated cell structure to support 4 threads: – 8 windows of 32 locations / thread – 3 read ports + 2 write ports – Read/write: single cycle latency • 1 Active Window Cell (copy of the architectural set window) Thread scheduling • Thread selection based on: – Previous long latency instruction in pipe – Instruction type – LRU status • Select & Fetch coupled Memory

• 16 KB 4 way set assoc. I$/ core Write through • 8 KB 4 way set assoc. D$/ core • allocate LD • no-allocate ST • 3MB 12 way set assoc. L2 $ shared – 4 x 750KB independent banks – 2 cycle throughput, 8 cycle latency – Direct link to DRAM & Jbus – Manages cache coherence for the 8 cores – CAM based directory Performance

IBM p5-550 with 2 dual-core Power5 Dell PowerEdge Test\Architecture T2000 chips SPECjbb2005 (Java server 24,208 (SC1425 with dual 63,378 61,789 software) business operations/ sec single-core Xeon) SPECweb2005 (Web server 4,850 (2850 with two dual- 14,001 7,881 performance) core Xeon processors) NotesBench (Lotus Notes 16,061 14,740 performance) “Home run“ ?

• Relatively slow single-thread performance • Poor floating-point performance • Lack of software support ( Sun Fire T2000 does not support Linux or Windows) •Price • Concurrency counterattack – no place as a general-purpose computer running databases – small low-end market segment ? • Niagara II & The “” – multiprocessor & enhanced single thread support References

• [1] P. Kongetira, et al, “A 32-Way Multithreaded SPARC Processor,” IEEE Micro, vol. 25, pp. 21-29, Mar., 2005. • [2] A. S. Leon, et al, “A Power-Efficient High-Throughput 32-Thread SPARC Processor”, ISSCC 2006 , SESSION 5 , PROCESSORS • [3] S. Chaudhry, S. Yip, P. Caprioli and M. Tremblay, “High Performance Throughput Computing” , IEEE Micro, vol. 25, Issue 3, 2005 •[4] http://opensparc.sunsource.net/nonav/opensparct1.html •[5] http://www.sun.com/processors/UltraSPARC-T1/features.xml •[6] http://www.sun.com/servers/coolthreads/t1000/benchmarks.jsp •[7] http://news.com.com/Sun+begins+Sparc+phase+of+server+overhaul/2163-1010_3- 5983365.html •[8] http://h71028.www7.hp.com/ERC/cache/280124-0-0-0-121.html