POWER7 Performance Overview

SPXXL 2010 – San Francisco, 9 -16 May 2010

Raj Panda

256 Core Nodes Water Cooled

4U 32 Core Air Cooled

Blades

2 © 2010 IBM Corporation SPXXL 2010 – San Francisco, 9 -16 May 2010 POWER7 POWER7 . Core options: 8 ( For HPC ) . 567mm2 Technology: P7 P7 P7 P7 – 45nm lithography, Cu, SOI, eDRAM P Core Core Core Core . Transistors: 1.2 B O S – Equivalent function of 2.7B W L2 L2 L2 L2 M – eDRAM efficiency E R P . Eight processor cores – 12 execution units per core G L3 Cache (32MB) F – 4 Way SMT per core X A – 32 Threads per chip – 256 KB L2 per core B R . 32MB on chip eDRAM shared L3 B L2 L2 L2 L2 I . Dual DDR3 Memory Controllers U C – 100 GB/s Memory bandwidth per chip S P7 P7 P7 P7 . Scalability up to 32 Sockets Core Core Core Core – 360 GB/s SMP bandwidth/chip – 20,000 coherent operations in flight . Advanced pre-fetching Data and Instruction Memory Interface

Memory++

POWER7: Core DFU 64-bit PowerPC architecture v2.07 ISU VSX FPU Out of Order Execution FXU Execution Units • 2 Fixed Point Units • 2 Load Store Units IFU • 4 Double Precision Floating Point Units CRU/BRU • 1 VMX Unit LSU • 1 Decimal Floating Point Unit • 1 Branch • 1 Condition Register • 6 Wide Dispatch L2 Cache • Units include distributed Recovery Function

. POWER7 continues to support VMX / Extends SIMD support with VSX – 2 VSX units that can each handle 2 Double-Precision FP instructions – 8 FLOPS per cycles – VSX units can also handle 4 Single Precision instructions per cycle – VSX instruction set support for vector and scalar instructions

. 64 Entry Vector/Scalar Register File – 128-bit wide registers – Used for 32b/64b scalar as well as 4x32B/2x64b SIMD instructions . Floating point instructions and issue rates: – Up to two instructions can be issued to the VSU in a given cycle – one for each pipeline – Instructions executed by pipe0 can be a 128-bit simple fixed point operation (Altivec), 128-bit complex fixed point operation (Altivec), 4-way SIMD single-precision FPU operation (Altivec), a 2-way SIMD double-precision FPU operation (VSX) or a scalar floating point (single or double precision) operation. – Instructions executed by pipe1 can be a 128-bit permute (Altivec or VSX permute), a store, a scalar floating point (single or double precision) operation, or a 2-way SIMD double- precision FPU operation (VSX). – So there can be two simultaneous VSX instructions executing at once, each handling 2 double-precision FP operations. Since each operation can be a FP multiply-add (FMA) that gives a peak of 2x2x2=8 double-precision FP operations per cycle. – Different from previous implementations, the scalar and vector FP operations are all executed within the VSU. – Because there are two scalar FXU pipelines independent of the VSU, two additional FXU operations, for logical operations and/ or array indexing, can be executed at the same time as VSU operations . Floating Point Operations are ANSI/IEEE standard 754-1985 Compliant Scalar: 4 ops/cycle (DP or SP)

Vector (VSX): 8 ops/cycle (DP)

Vector (Altivec): 8 ops/cycle (SP)

. Using compiler: compiler versions that recognize the POWER7 architecture are XL C/C++ 11.1 and XLF Fortran 13.1. – For C: • xlc –qarch=pwr7 –qtune=pwr7 –O3 –qhot -qsimd – For Fortran: • xlf –qarch=pwr7 –qtune=pwr7 –O3 –qhot . Using ESSL libraries with vectorization support: – Select routines have vector analogs in the library – Key FFT, BLAS routines

. Loop carried data dependencies for (i = 0; i < N; i++) dc[i] = dc[i-1] + af[i]

. Unresolved aliases double sub(double *a, double *b, double *c) { for (i = 0; i < N; i++) a[i] = a[b[i]] + c[i] : } . Non-stride-1 accesses for (i = 0; i < N; i+=4) a[i] = b[i] + c[i]

NOTES: 1. Non-library version of FFT 2. Library versions (ESSL, FFTW) would show similar behavior with increasing vector length

. SMT is a processor technology that allows – separate instruction streams (threads) to run concurrently on the same physical processor improving overall throughput

. P7 supports 2-way and 4-way SMT – SMT4 gain for commercial applications range between 1.7x to 2.2x of single-thread performance – SMT4 gains limited by resources constraints such as fewer FP rename registers per thread, fewer instruction buffer entries by thread, …

. Not all applications benefit from SMT – Cases where the performance may not be improved and even possibly degrade • applications with execution-unit–limited performance (LINPACK for example) • applications that consume all the chip's memory bandwidth (STREAM for example) . SMT gain on P7 – Less than on p6 due to out-of-order architecture

. Capacity-oriented workload: lot of serial jobs

– Both SMT2 and SMT4 are likely to improve performance

. Capability oriented workload: turnaround time is critical

– SMT2 may show some benefit – SMT4 is very unlikely to show benefit

. Capability-capacity: (lot of parallel jobs) – SMT2 – may show up to 10% benefit – SMT4 – may show a few percent more than with SMT2 . Above heuristics are very dependent on

– performance characteristics of the individual apps in the job stream – mileage may vary but worth trying

4 Processor Sockets = 32 Cores POWER7 Architecture 8 Core @ 3.3 GHz

DDR3 Memory 128 GB / 256 GB, 32 DIMM Slots

Up to 8 SFF SAS DASD (2.4TB) DASD / Bays 73 / 146 / 300GB @ 15K 4U x 28.8” depth (Opt: RAID)

Up to 8.4 TFlops per Rack PCIe x8: 3 Slots (1 shared) ( 10 nodes per Rack ) Expansion PCI-X DDR: 2 Slots GX++ Bus

Integrated Ports 3 USB, 2 Serial, 2 HMC

Quad 1Gb Copper Integrated Ethernet (Opt: Dual 10Gb Cu or Fiber)

Media Bays 1 Slim-line ( No tape support )

64 nodes) Cluster Ethernet or IB-DDR

Yes (AC or DC Power) Redundant Power Single phase 240vac or -48 VDC

5.3 / 6.1 RHEL / SLES Cetrtifications NEBS / ETSI for harsh environments

Power 755 Power 575 Cores/chip 8 2 Total cores 32 32 Frequency 3.3 GHz 4.7 GHz Memory (max) 256 GB 256 GB Cooling Air Water Cores/rack 320 448 Rack type 19” 24” Power (Watts) (Linpack) 1650 5400

Each Power 755 node offers the same core count as Power 575 with:

. 40-50% Improvement in Performance . Air Cooling vs. Water Cooling . 1/3 of the Energy Consumption . 37% Improvement in floor space for a 64 node configuration . Green500 ~ 592 MFlops/Watt 14 © 2010 IBM Corporation SPXXL 2010 – San Francisco, 9 -16 May 2010 Power 755 vs Power 575: Standard Benchmarks

Power 755 AIX 6.1

Peak performance (GFLOPS ) (*) 844 STREAM (triad) (GB/s) 122 Linpack (HPL) (GFLOPS ) 820 SPECfp_rate2006 825 SPECint_rate2006 1010

(*) – at nominal frequency of 3.3 GHz. Using DPS-FP mode, 755 can be run at a higher frequency

Source: Balaji Atyam, Carlos Sosa, Tony Pirraglia

• Small DFT: Density Function Theory Frequency calculation on a small molecule • Medium Force: Split Basis Function Force calculation on a medium sized molecule • Large DFT: Density Function Theory Frequency calculation on a large molecule • Binary built with XLF V9, VAC v7, -qarch=pwr4 –qtune=pwr4

PRELIMINARY results; Final to be published by Gaussian inc.

Source: Balaji Atyam, Tony Pirraglia

1. ABAQUS was tested using standard version of ESSL & VSX enabled (pre-GA) version of ESSL 2. ABAQUS Standard Benchmark cases (S2a, S4a, etc.) were used 3. ABAQUS uses the DGEMM routine from ESSL 4. Performance benefit varies depending on the size of matrices and the DGEMM content 5. There are benchmark cases (e.g. S5) with no performance improvement with VSX 6. VSX exploitation REQUIRES Power7 capable ESSL and XLF/XLC runtime environments.

HMMER benchmark

cores Scalar Vector (Altivec) Vector/Scalar (secs) (secs) Ratio 1 2879 502 5.74 2 1442 252 5.72 4 724 140 5.17 8 365 80 4.56 16 205 59 3.47

Notes: . table entries are elapsed time in seconds . Power 755 with 32 cores at 3.3 GHz, xlc 11.1 beta compiler

POWER7 Blades

POWER7: 8 cores per socket Architecture Single or double wide

L2 & L3 Cache On Chip

DDR3 Memory Up to 128GB / 256 GB

0 - 2 SSD per side 8 Cores DASD / Bays 0 - 1 SAS per side 16 Cores Daughter Card Options Legacy, SFF, or High speed PCIe

Dual Port 10/100/1000 Ethernet Integrated Options Ethernet, USB

Fiber Support Yes ( via Blade center )

Media Bays 1 Blade Center

Clustering 10Gbt Ethernet

Redundant Power Yes Blade Center Up to 2.7 TF / BladeCenter 10.75 TF / Rack Redundant Cooling Yes Blade Center ( 14 Blades per Chassis ) Service Processor Yes

JS22 JS23 PS701

CPU Power6 Power6 Power7

Core frequency (GHz) 4 4.2 3

Cores/Single wide blade 4 4 8

RAM 4x4GB 8x4GB 16x4GB

DIMM speed (MHz) 667 677 1066

. Active Energy Manager is configurable using IBM Systems Director SPEC Benchmark Performance Characteristic . Offers 3 modes of energy management 416.gamess Core intensive . Static Power Saver (SPS) 433.milc Mem. bandwidth intensive – Static: Active processor frequency set at 30% below nominal (2.31GHz) 435.gromacs Core intensive – Folding will set idle cores to Nap (1.65 GHz) or to 437.leslie3d Mem. bandwidth intensive Sleep (0 GHz) 444.namd Core intensive – Maximum energy savings – used for long periods of 459.GemsFDTD Mem. bandwidth intensive low utilization . Dynamic Power Saver (DPS) – Processor frequency is set based on processor core utilization – Un-utilized cores set to 1.65 GHz and ramped up as utilization increases to maximum 90% of nominal frequency (2.97 GHz) – This feature prefers power savings over performance . Dynamic Power Saver – Favor Performance (DPS-FP) – Processor frequency is set based on processor core utilization – Un-utilized cores set to 1.65 GHz and ramped up as utilization increases to maximum 107% of nominal frequency (3.53 GHz) – This feature prefers maximum performance over power savings Correlation of performance and power consumption

. Linpack Benchmark with Active Energy Manager . AEM “Over–Clocking” support with DPS-FP – Dynamic Power Save - Favor Performance . Test environment – Power 755 32core @ 3.3GHz – XLC V11.0 beta – ESSL V5.1 beta – PE V5.2 . In case of AEM=OFF – 692.7 Gflops* (82.0% efficiency) . In case of AEM=ON – DPS-FP – 753.6 Gflops* (89.2% efficiency) . Using AEM: 8.8% Performance Gain – 753.6 Gflops / 692.7 Gflops

* Power 755 performance projected from actual Power 750 results

Power 755 cluster delivers superior Linpack Mflops per watt performance compared to Sun 4-socket x86 cluster

1.7 times the Mflops/watt vs Sun X6440 cluster

IBM Power 755

. Power 755 was 16 node (512 core) POWER7 3.3 GHz cluster with Infiniband: 11.15 TFlops Linpack Rmax/13.517 TFlops Rpeak, 18.8 kw (Source: IBM measurements) . Sun x6440 was 6440 core 2.5 GHz Opteron with Infiniband: 51.88 TFlops Linpack Rmax/64.64 TF Rpeak, 152 kw (Source: November 2009 Little Green500 list www.green500.org)

Number shown in column is November 2009 TOP500 rank Power7 is there !! ! Megaflops/watt

Rank Site Mfgr System Rmax MF/w Relative 1 ORNL Cray Jaguar XT5 HE 2.6 GHz 6C Opteron 1759 253 1.76 6.9 Mw 2 LANL IBM Roadrunner QS22/LS21 1042 444 1 2.3 Mw 3 U of Tenn Cray Kraken XT5 HE 2.6 GHz 6C Opteron * 831.7 253 1.76 4 Juelich IBM Blue Gene/P 825.5 364 1.22 2.2 Mw 5 NUDT Self Intel Nehalem/AMD Radeon GPU * 563.1 6 NASA Ames SGI QC 3.0 Xeon 544.3 232 1.92 7 LLNL IBM Blue Gene/L 478.2 205 2.16 Source: www.top500.org 8 ANL IBM Blue Gene/P 458.6 364 1.22 Notes: * Kraken power is 9 TACC Sun 2.3 GHz QC Opteron 433.2 217 2.05 scaled down from Jaguar, 10 Sandia Sun Red Sky 2.93 Nehalem * 423.9 Source:NUDT, Sandia/Sun www.top500.org did not Mflops/wattprovide power is numberscalculated by dividing “Rmax” by “Power”, Green500 will not publish until a later date 26 © 2010 IBM Corporation SPXXL 2010 – San Francisco, 9 -16 May 2010 Summary

. Power 755 targets Divisional and Departmental HPC Segments – 4S single node systems – Clusters with GigE and IB networks . Power 755 provides an ideal migration path for currently deployed systems such as: – System p5 550 and Power 550 (POWER6) – System p5 575 and Power 575 (POWER6) – JS21 Blade Clusters . Power 755 brings great improvement over Power 575 with : – 2 x improvement in price performance – 3 x improvement in power consumption (Air cooled vs. Water cooled) – 1.7 x improvement in density – But, with lower interconnect bandwidth . Application segments – Weather – Reservoir modeling – Financial Services – Computational Chemistry/Molecular Dynamics

...any Questions?