HPC on the UltraSPARC CMT Processors Deepak Jeevan Kumar, Verdi March, Henry Kasim, Simon See {deepak.jeevankumar, verdi.march, henry.kasim, simon.see}@sun.com Version 2.7b – 16st September 2008

1. Will the UltraSPARC Tx be Suitable for HPC workloads? This question is a Holy Grail that is currently being treasure hunted not only by SPARC fans but also by the hard-core x64 fans. Why? The T1, T2 and the T2 Plus are such enigmatic processors that have an insane plethora of threads (32/64 threads), have the highest core count on a single- die (8 cores). At the same time it has just too many seemingly “anti-HPC” characteristics – it is contented with just 1 or 2 physical CPUs per , runs at an absurdly low “non-HPC” clock speed (1.2–1.4 GHz), has room for just one FPU per core, and a relatively weak SIMD support compared to x64 which supports the SSE instruction set family. To verify whether these characteristics in practice hinder the performance of actual floating-point- intensive applications, we summarize and analyze the public results of SPECfp_rate2006 on four real dual-chip systems: ● A system with two Sun UltraSPARC T2 Plus CPU chips (eight cores) ● Three dual-chip (aka dual-socket) systems representing the alternative state-of-the-art CPU architectures, namely quad-core AMD Opteron, dual-core IBM Power 6, and quad-core Intel Xeon. The representative for a CPU architecture is the dual-chip system of that particular CPU, which attains the highest SPECfp_rate2006. See the References and Acknowledgments section for more information.

Our two major findings are: 1. The T2 Plus system achieves the highest result for SPECfp_rate2006 for dual-chip systems. The SPECfp_rate2006 metric represents the overall performance of a system based on 17 benchmark applications. See the References and Acknowledgments section for more information. 2. Of the 17 benchmarks comprising the SPECfp_rate2006 metric, the T2 Plus system achieves the top-two performance for 13 applications (76%). As a comparison, Opteron, Power 6, and Xeon achieve the top-two performance only on 5 (29%), 9 (53%), and 7 (41%) applications, respectively. This demonstrates that CMT and high memory bandwidth address the need of existing HPC applications, and therefore, it justifies Sun's commitment in focusing on balanced system design rather than solely on the raw floating-point capability.

2. Processors and Servers in the UltraSPARC Tx family We are currently in the third generation of the UltraSPARC Tx family of processors also known as CMT (Chip Multithreading Processors) or Coolthreads processors (http://www.sun.com/ coolthreads). Table 1 shows Sun servers based on this family.

Page 1 of 7 Table 1

Processor Server Max. # sockets / # Max. cores / Rack GHz server server Units UltraSPARC T1 T1000 1.0 1 8 1 RU UltraSPARC T1 Sun Fire T2000 1.4 1 8 2 RU UltraSPARC T1 T6300 1.4 1 8 Blade UltraSPARC T2 Sun SPARC Enterprise T5120 1.4 1 8 1 RU UltraSPARC T2 Sun SPARC Enterprise T5220 1.4 1 8 2 RU UltraSPARC T2 Sun Blade T6320 1.4 1 8 Blade UltraSPARC T2 Plus Sun SPARC Enterprise T5140 1.2 2 16 1 RU UltraSPARC T2 Plus Sun SPARC Enterprise T5240 1.4 2 16 2 RU Note: The number of sockets refers to the number of chips.

3. Theoretical Peak Floating-Point Performance of Dual-Socket (aka Dual-Chip) Servers Table 2 summarizes the peak floating-point performance of the dual-socket (aka dual-chip) servers available today with the fastest CPUs from different vendors. As mentioned in Section 1, these vendors are chosen as they attain the highest SPECfp_rate2006 for their CPU architecture.

Table 2

Server Processor GHz # cores / server Rpeak (GFLOPS) Sun SPARC Enterprise T5240 2 x UltraSPARC T2 Plus 1.4 16 22.40 3DBOXX 8400 Special Edition 2 x Xeon X5482 3.2 8 96.00 IBM BladeCenter LS22 2 x Opteron 2356 2.3 8 73.60 IBM System p 570 2 x Power 6 4.7 4 75.2

Does this low Rpeak of 22.40 GFLOPS mean much in real world applications? To answer this question, we first analyze the SPECfp_rate2006 results.

4. SPECfp_rate2006 on the UltraSPARC T2 Plus

4.1 Analysis of the Peak Results – UltraSPARC T2 Plus Beats Power 6 As the first step we took a quick look at SPECfp_rate2006. We were pleasantly shocked. Table 3 below summarizes the peak SPECfp_rate2006.

Table 3

Server Processor GHz # cores / server SPECfp_rate2006 Sun SPARC Enterprise T5240 2 x UltraSPARC T2 Plus 1.4 16 119.53 3DBOXX WORKSTATION 8400 Special Edition 2 x Xeon X5482 3.2 8 88.7 IBM BladeCenter LS22 2 x Opteron 2356 2.3 8 94.7 IBM System p 570 2 x Power 6 4.7 4 116

Page 2 of 7 The peak score of 119 is the fastest SPECfp_rate2006 score amongst the scores of all dual-chip servers. Now let us analyze Table 4 which divides the SPECfp_rate2006 number by the Rpeak. Since SPECfp_rate2006 is not measured in GFLOPS, each ratio by itself does not carry any information. Rather, it can be used to compare only with the ratio of another system.

Table 4

Server Processor GHz # cores / server SPECfp_rate2006 / Rpeak Sun SPARC Enterprise T5240 2 x UltraSPARC T2 Plus 1.4 16 5.34 3DBOXX WORKSTATION 8400 Special Edition 2 x Xeon X5482 3.2 8 0.92 IBM BladeCenter LS22 2 x Opteron 2356 2.3 8 1.29 IBM System p 570 2 x Power 6 4.7 4 1.54

Apparently, the UltraSPARC T2 Plus chip did an excellent job in squeezing out the maximum performance from its FPUs. The 128 threads and high per-core memory bandwidth indeed result in a significantly higher SPECfp_rate2006: 3.5–5.8 times of the other three systems. It even beats the IBM p570 (2 x Power 6) that has 3.3 times its peak performance (Rpeak).

4.2 Analysis of Individual SPECfp_rate2006 Test Results To find out more whether the UltraSPARC T2 Plus chips are suitable for certain kinds of floating- point workloads, we analyzed the results of Specfp_rate2006 on the 4 different dual-chip servers. These results comprises of the results of 17 applications and the geometric mean of those (Table 5).

Table 5 2 x UltraSPARC 2 x Xeon 2 x Opteron 2 x Power 6 T2 Plus X5482 2356 Abbreviation S X O P Highest-to-Lowest SPECfp_rate2006 119 88.7 94.7 116 S-P-O-X 410.bwaves 151 44.3 96.7 188 P-S-O-X 416.gamess 106 191 123 92.8 X-O-S-P 433.milc 146 30.9 73.1 86.1 S-P-O-X 434.zeusmp 107 84.5 95.8 135 P-S-O-X 435.gromacs 104 162 108 78.6 X-O-S-P 436.cactusADM 105 112 100 121 P-X-S-O 437.leslie3d 102 38.0 62.8 113 P-S-O-X 444.namd 117 136 96.4 99.5 X-S-P-O 447.dealII 193 151 166 159 S-O-P-X 450.soplex 117 48.4 61.8 119 P-S-O-X 453.povray 200 254 137 99.9 X-S-O-P 454.calculix 77.7 179 115 112 X-O-P-S 459.GemsFDTD 87.8 35.9 59.5 88.8 P-S-O-X 465.tonto 127 139 117 112 X-S-O-P 470.lbm 85.3 52.1 59.7 189 P-S-O-X 481.wrf 129 69.7 106 82.3 S-O-P-X 482.sphinx3 148 105 102 174 P-S-X-O

LEGEND Highest Lowest

Page 3 of 7 From the above table, the followings are observed: ● UltraSPARC T2 Plus scores the highest result of SPECfp_rate2006, beating the Power 6 system by 2.6%. ● UltraSPARC T2 Plus achieve good performance on majority of the 17 applications, as detailed below: ○ UltraSPARC T2 Plus emerges as the best performer in 3 of the 17 tests (milc, dealII and wrf). ○ UltraSPARC T2 Plus is the 1st or 2nd in 13 of the 17 tests (76%) ○ UltraSPARC T2 Plus is the 3rd or 4th in 4 of the 17 tests (gamess, gromacs, cactusADM and calculix). ○ UltraSPARC T2 Plus scores the lowest rate only in 1 of the 17 tests (calculix) ● UltraSPARC T2 Plus compares favorably versus the other three systems although it has a lower theoretical peak performance (Rpeak): ○ UltraSPARC T2 Plus beats x64 (Opteron and Xeon) in 10 of the 17 tests (bwaves, milc, zeusmp, leslie3d, dealII, soplex, GemsFDTD, lbm, wrf, sphinx3) ■ UltraSPARC T2 Plus beats the Xeon system in 10 of the 17 tests ■ UltraSPARC T2 Plus beats the Opteron system in 14 of the 17 tests ○ UltraSPARC T2 Plus beats the Power 6 system in 8 of the 17 tests (gamess, milc, gormacs, namd, dealII, povray, tonto, wrf) ● Xeon performs the best in 6 of 17 tests – especially in molecular dynamics applications ● The AMD Opteron never emerge as the best performer ● Power 6 is the worst performer in 4 tests (gamess, gromacs, povray, tonto)

Table 6 shows the applications on which each system achieves the top-two SPECfp_rate2006. As can be observed, the T2 Plus system achieves the top-two performance on 13 applications out of 17 (76%), whereas the Opteron, Power 6, and Xeon system achieves the top-two performance only on 5 (29%), 9 (53%), and 7 (41%) applications, respectively.

Table 6 2 x UltraSPARC 2 x Xeon 2 x Opteron 2 x Power 6 T2 Plus X5482 2356 Abbreviation S X O P Highest-to-Lowest SPECfp_rate2006 119 88.7 94.7 116 S-P-O-X 410.bwaves 151 44.3 96.7 188 P-S-O-X 416.gamess 106 191 123 92.8 X-O-S-P 433.milc 146 30.9 73.1 86.1 S-P-O-X 434.zeusmp 107 84.5 95.8 135 P-S-O-X 435.gromacs 104 162 108 78.6 X-O-S-P 436.cactusADM 105 112 100 121 P-X-S-O 437.leslie3d 102 38.0 62.8 113 P-S-O-X 444.namd 117 136 96.4 99.5 X-S-P-O 447.dealII 193 151 166 159 S-O-P-X 450.soplex 117 48.4 61.8 119 P-S-O-X 453.povray 200 254 137 99.9 X-S-O-P 454.calculix 77.7 179 115 112 X-O-P-S 459.GemsFDTD 87.8 35.9 59.5 88.8 P-S-O-X 465.tonto 127 139 117 112 X-S-O-P

Page 4 of 7 470.lbm 85.3 52.1 59.7 189 P-S-O-X 481.wrf 129 69.7 106 82.3 S-O-P-X 482.sphinx3 148 105 102 174 P-S-X-O

LEGEND Top-Two Performance Bottom-Two Performance

Our findings indicate that in many cases, the actual performance is not hindered by the perceived lack of floating-point capability of UltraSPARC T2 Plus. In addition, it further emphasizes the future of HPC is to balance raw floating-point performance with massive parallelism and high memory bandwidth.

4.2.1 Tests where UltraSPARC T2 Plus Beats x64 (Xeon and Opteron) Let us now focus only on those 10 tests where the UltraSPARC T2 Plus server beats the x64 servers (Table 7). From this table, we observe that: ● UltraSPARC T2 Plus is doing great at CFD applications. ● UltraSPARC T2 Plus does well in sparse linear algebra (soplex), a result that is further strengthened by the work done by UCB.

Table 7

Application Name Application Category Description UltraSPARC T2 UltraSPARC T2 Plus vs Xeon Plus vs Opteron 410.bwaves Computational Fluid Navier-Stokes, Bi-CGstab algorithm +241% +56% Dynamics 433.milc Physics / Quantum Lattice Gauge Theory, serial version +372% +100% Chromodynamics of the su3imp program 434.zeusmp Physics / ZEUS-MP solves problems in three +27% +12% Magnetohydrodynamics spatial dimensions with a wide variety of boundary conditions. 437.leslie3d Computational Fluid Strongly-conservative, finite-volume +168% +62% Dynamics algorithm with the MacCormack Predictor-Corrector time integration scheme 447.dealII PDE using Adaptive Solves an equation (a Helmholtz- +28% +16% FEM type equation with non-constant coefficients) that is at the heart of solvers for a wide variety of applications. 450.soplex Simplex Linear Sparse Linear Algebra, linear +142% +89% Program (LP) Solver program using the Simplex algorithm 459.GemsFDTD Computational Solves the Maxwell equations in 3D +145% +48% Electromagnetics in the time domain using the finite- difference time-domain (FDTD) method. Fast Fourier transforms (FFT) are employed in the post- processing of the NFT_mod. 470.lbm Computational Fluid “Lattice Boltzmann Method" (LBM) +64% +43% Dynamics, Lattice is used to simulate incompressible Boltzmann Method fluids in 3D 481.wrf Weather Forecasting 3-dimensional variational (3DVAR) +85% +22% data assimilation system

Page 5 of 7 482.sphinx3 Speech Recognition Sphinx-3 is a widely known speech +41% +45% recognition system from Carnegie Mellon University.

4.2.2 Tests where UltraSPARC T2 Plus Beats Power 6 Let us now focus only on those 8 applications where the UltraSPARC T2 Plus beats Power 6 (Table 8). It appears that UltraSPARC T2 Plus achieves a higher SPECfp_rate2006 than the Power 6 system in the molecular dynamics applications.

Table 8

Application Name Application Category Description UltraSPARC T2 Plus vs Power 6 416.gamess Quantum Chemical Self-consistent field +14 % Computations (SCF)computation using the direct SCF method 433.milc Physics / Quantum Lattice Guage Theory, serial +70% Chromodynamics version of the su3imp program 435.gromacs Chemistry / Molecular Simulation of the Newtonian +32% Dynamics equations of motion for systems with hundreds to millions of particles. 444.namd Scientific, Structural NAMD is a parallel program for the +18% Biology, Classical simulation of large biomolecular Molecular Dynamics systems. Simulation 447.dealII PDE using Adaptive Solves an equation (a Helmholtz- +21% FEM type equation with non-constant coefficients) that is at the heart of solvers for a wide variety of applications. 453.povray Computer Visualization POV-Ray is a ray-tracer. +100% 465.tonto Quantum The profiles of Tonto calculations +13% Crystallography are typical of many ab initio quantum chemistry packages. That is, a large portion is dedicated to the evaluation of integrals between products of Gaussian basis functions. 481.wrf Weather Forecasting 3-dimensional variational (3DVAR) +57% data assimilation system

5. HPC Benchmarks on the UltraSPARC T2 – Analysis from “Evaluating UltraSPARC T2 Throughput Performance using the PEAS suite” by Dr. Ruud van der Paas, Distinguished Engineer, Sun Microsystems ● PDF: https://events-at-sun.com/hpcreno/presentations/Ruud_van_der_PAS_UltraSPARC_T2.pdf ● Blog: http://blogs.sun.com/ruud/

“Performance Results for the UltraSPARC T2 Processor on HPC Workloads“ by Partha Thirumalai,

Page 6 of 7 Distinguished Engineer, Sun Microsystems ● PDF: https://events-at- sun.com/hpcreno/presentations/UltraSPARC_T2_HPC_Performance.pdf ● Youtube: https://hpc.sun.com/blog/richbruecknersuncom/consortium-video-performance-results- -t2-processor-hpc-workloads

6. HPC Benchmarks on the UltraSPARC T2 Done at the Aachen University “UltraSPARC T2 for HPC” ● PDF: https://events-at-sun.com/hpcreno/presentations/UltraSPARC_T2_HPC.pdf ● Webpage: http://www.rz.rwth-aachen.de/ca/k/raw/?lang=en

7. HPC Benchmarks on the UltraSPARC T2 Done at UCB “Optimization of Sparse Matrix-vector Multiplication on Emerging Multicore Platforms” ● PDF: http://cacs.usc.edu/education/cs596/Williams-OptMultiCore-SC07.pdf

References and Acknowledgments ● Brian L. Whitney (Senior Staff Engineer, Sun Microsystems) and Denis J. Sheahan (Distinguished Engineer, Sun Microsystems) for assisting with regard to the SPEC benchmark and UltraSPARC T2 Plus processor. ● SPEC and SPECfp are registered trademarks of Standard Performance Evaluation Corporation. Further information on SPEC may be found at http://www.spec.org. ● The presented SPECfp_rate2006 results were obtained from the SPEC website on September 16, 2008. ○ Sun SPARC Enterprise T5240 (Sun UltraSPARC T2 Plus) http://spec.org/cpu2006/results/res2008q2/cpu2006-20080407-04061.html ○ 3DBOXX WORKSTATION 8400 Special Edition (Intel Xeon X5482) http://www.spec.org/cpu2006/results/res2008q1/cpu2006-20071226-02920.html ○ IBM BladeCenter LS22 (AMD Opteron 2356) http://www.spec.org/cpu2006/results/res2008q3/cpu2006-20080623-04663.html ○ IBM System p 570 (IBM Power 6) http://spec.org/cpu2006/results/res2007q4/cpu2006-20071030-02421.html

Page 7 of 7