IBM Systems & Technology Group, IBM India May 2011

SPECfp ®2006 CPI Stack & SMT Benefits Analysis on POWER7 systems

Satish Kumar Sadasivam Prathiba Kumar IBM India Systems and Technology Labs, Bangalore Email: [email protected] , [email protected]

© 2011 IBM Corporation POWER7 Processor Chip Cores : 8 ( 4 / 6 core options ) Local SMP Links 567mm 2 Technology: – 45nm lithography, Cu, SOI, eDRAM POWER7 POWER7 POWER7 POWER7 CORE F CORE CORE CORE A Transistors: 1.2 B S – Equivalent function of 2.7B T L2 Cache L2 Cache L2 Cache L2 Cache – eDRAM efficiency L3 REGION Eight processor cores MC0 L3 Cache and MC1 – 12 execution units per core Chip Interconnect – 4 Way SMT per core – up to 4 threads per core L2 Cache L2 Cache L2 Cache L2 Cache – 32 Threads per chip – L1: 32 KB I Cache / 32 KB D Cache POWER7 POWER7 POWER7 POWER7 – L2: 256 KB per core CORE CORE CORE CORE – L3: Shared 32MB on chip eDRAM Dual DDR3 Memory Controllers Remote SMP & I/O Links – 100 GB/s Memory bandwidth per chip Scalability up to 32 Sockets Binary Compatibility with – 360 GB/s SMP bandwidth/chip POWER6 – 20,000 coherent operations in flight

2 © 2011 IBM Corporation 3 © 2011 IBM Corporation POWER7 PMU

 POWER7 has a powerful Performance Instrumentation on the chip forming the Performance Monitoring Unit(PMU).  8 Chip level self contained PMUlets distributed across the chip  PMU Events: – PMU events span a wide spectrum of events – Allow monitoring program characteristics(types of instructions, etc) and architectural and microarchitectural behaviour(FP overflow, exceptions, reorder queue full, etc).

 Performance Monitor Counters (PMCs): – Each thread has 6 Performance Monitor Counters (PMCs). – Each PMC is 32bit wide. – By connecting adjacent PMCs, PMC1-4 can be used as 32 x N (N=1 to 4)bit counter. – 4 64-bit core level counters – 2 programmable and 2 non-programmable  Programming the PMCs: – PMC1-4 are programmable. – PMC5 is a dedicated counter for Run Instructions (finished PowerPC instructions gated by the run latch). – PMC6 is a dedicated counter for Run Cycles (Cycles gated by the run latch).

4 © 2011 IBM Corporation 5 © 2011 IBM Corporation SPECfp®_rate2006 Results

SPECfp_rate2006

482.sphinx3 481.wrf 470.lbm 465.tonto 459.GemsFDTD 454.calculix 453.povray 450.soplex 447.dealII SMT4 444.namd SMT2 437.leslie3d ST 436.cactusADM 435.gromacs 434.zeusmp 433.milc 416.gamess 410.bwaves

0 100 200 300 400 500 600

SPECint, SPECfp and SPECrate are trademarks of the Standard Performance Evaluation Corp (SPEC).

Measurements are on a single chip experimental POWER7 system

6 © 2011 IBM Corporation Benchmarks’ CPI(Thread level) in each SMT mode

Benchmark Workload ST CPI SMT2 CPI SMT4 CPI

bwaves 0 1.44 2.92 5.94 gamess 0 0.70 1.08 1.93 gamess 1 0.58 0.92 1.66 gamess 2 0.58 0.91 1.61 milc 0 2.03 4.19 8.55 zeusmp 0 0.64 1.15 2.27 gromacs 0 0.65 1.07 1.89 cactusADM 0 0.96 1.76 3.53 leslie3d 0 1.14 2.46 5.07 namd 0 0.78 1.14 1.76 dealII 0 0.93 1.34 2.31 soplex 0 2.45 5.54 13.03 soplex 1 2.07 3.81 7.39 povray 0 0.85 1.25 2.11 calculix 0 0.56 0.92 1.67 GemsFDTD 0 2.11 4.32 8.66 tonto 0 0.72 1.17 2.32 lbm 0 1.17 2.30 4.69 wrf 0 0.65 1.15 2.29 sphinx3 0 1.08 2.32 4.54

7 © 2011 IBM Corporation Calculating SMT gains

 SMT gains are calculated in terms of CPI Workload SMT2 % Gain SMT4 % Gain bwaves 0 -0.97 -2.70  SMT Gain(%) Formula: gamess 0 28.56 43.87 gamess 1 25.99 39.88 gamess 2 27.09 43.18 milc 0 -3.01 -4.92

SMT Gain(%) = 100*( x CPI ST / CPI SMT ) –100 zeusmp 0 12.19 13.40 where, gromacs 0 20.61 36.87 x is the number of SMT threads cactusADM 0 8.54 8.36 leslie3d 0 -7.47 -10.05 CPI ST = (cycles ST /instructions ST ) namd 0 37.22 78.54 CPI SMT = (cycles SMT /instructions SMT ) dealII 0 38.59 60.90 soplex 0 -11.46 -24.79 soplex 1 8.81 12.27 povray 0 36.42 61.59 Example: calculix 0 22.14 34.30 gromacs GemsFDTD 0 -2.37 -2.54 ST CPI - 0.65; SMT4 CPI – 1.89 tonto 0 23.67 24.93 lbm 0 1.71 -0.29 SMT Gain(%) = 100*(4 * 0.65 / 1.89) – 100 wrf 0 13.07 13.18 = 36.87% sphinx3 0 -6.52 -4.43

8 © 2011 IBM Corporation SMT Gains ordering

SMT2 Gain 50.00

40.00

30.00

20.00

10.00

0.00

0 0 2 1 0 0 1 0 I_0 _0 _ _ _ I d y x_0 _0 m m a ss_ li cs_ rf ex_ b TD_0 eal r e a w smp pl l D milc_0 -10.00 d na m tonto_0 lcu m o pov ga gamess_ gamess_ ca zeu so usADM_0 sphinx3_0 leslie3d_0 soplex_0 gr ct bwaves_0 emsF ca G -20.00

SMT2 Gain

SMT4 Gain 100

80

60

40

20

0

0 0 0 0 0 0 1 0 0 0 _ 0 _ _1 _ _0 _ _0 o x p_0 rf D_ x -20 md_ alII_0 lix_ nt M lbm_ T x3_0 a e ess_ ess_ u ple sm w D D n milc ple n d m m lc to o A o povray_0 a s eu s sF waves_ phi s ga gamess_2 ga z m b s le slie3d gromacs_0 ctu e a -40 c G SMT4 Gain

9 © 2011 IBM Corporation SMT Gain categories

Benchmark Workload SMT2 % Gain SMT4 % Gain

bwaves 0 -0.97 -2.70  Benchmark SMT Gain Categories: gamess 0 28.56 43.87 gamess 1 25.99 39.88 – High Gain gamess 2 27.09 43.18 – Medium Gain milc 0 0 -3.01 – Minimal Gain/No Loss zeusmp 0 12.19 13.40 – Loss gromacs 0 20.61 36.87 cactusADM 0 8.54 8.36 leslie3d 0 -7.47 -10.05 namd 0 37.22 78.54 dealII 0 38.59 60.90 soplex 0 -11.46 -24.79 soplex 1 8.81 12.27 povray 0 36.42 61.59 calculix 0 22.14 34.30 GemsFDTD 0 -2.37 -2.54 tonto 0 23.67 24.93 lbm 0 1.71 -0.29 wrf 0 13.07 13.18 sphinx3 0 -6.52 -4.43

10 © 2011 IBM Corporation CPI Stack for POWER7 – provides the breakdown of average cycles-per-instruction in terms of CPU resource usage – Valuable way of understanding performance aspects of a program execution Stall by FXU FXU Multi-Cycle Instruction (PM_CMPLU_STALL_DIV) (PM_CMPLU_STALL_FXU) FXU Other (PM_CMPLU_STALL_FXU_OTHER)

Stall By Scalar Long Stall By Scalar (PM_CMPLU_STALL_SCALAR_LONG) (PM_CMPLU_STALL_SCALAR) Stall by VSU Stall By Scalar Other (C3A +C3B +C3C) (PM_CMPLU_STALL_SCALAR_OTHER) (PM_CMPLU_STALL_VSU) Stall by Vector Long (PM_CMPLU_STALL_VECTOR_LONG) Stall by Vector Stall By Vector Other (PM_CMPLU_STALL_VECTOR) (PM_CMPLU_STALL_VECTOR_OTHER)

Stall by DFU (PM_CMPLU_STALL_DFP)

Completion Stall Cycles (PM_CMPLU_STALL) Stall by Reject Translation Stall (PM_CMPLU_STALL_REJECT) (PM_CMPLU_STALL_ERAT_MISS) Stall by LSU Other Reject Cycles (PM_CMPLU_STALL_LSU) (PM_CMPLU_STALL_REJECT_OTHER) (PM_RUN_CYC) Stall by D-Cache Miss (PM_CMPLU_STALL_DCACHE_MISS)

Stall by Store < C1C> (PM_CMPLU_STALL_STORE) LSU Other (PM_CMPLU_STALL_LSU_OTHER) Stall due to SMT (PM_CMPLU_STALL_THRD)

Stall due to BRU (PM_CMPLU_STALL_BRU) Stall due to IFU (PM_CMPLU_STALL_IFU) Other IFU Stall (PM_CMPLU_STALL_IFU_OTHER)

Other Stall (PM_CMPLU_STALL_OTHER)

GCT Empty due to Icache Miss (PM_GCT_NOSLOT_IC_MISS) GCT Empty Cycles GCT Empty due to Branch Mispredict (PM_GCT_NOSLOT_BR_MPRED) ( PM_GCT_NOSLOT_CYC) GCT Empty due Branch MisPrdict and Icache Miss PM_GCT_NOSLOT_BR_MPRED_IC_MISS)

GCT Empty other (PM_GCT_EMPTY_OTHER)

Base Completion Cycles (PM_1PLUS_PPC_CMPL) Completion Cycles (Group Completed) (PM_GRP_CMPL) Overhead of expansion 11 © 2011 IBM Corporation CPI Stack for SPEC FP benchmarks

CPI Stack - ST

sphinx3 FXU STALL wrf VSU_SCALAR_STALL lbm tonto VSU_VECTOR_STALL GemsF VSU_DFU_STALL calculix LSU_REJECT_STALL povray LSU_DCACHE_STALL soplex soplex LSU_STORE_STALL dealII LSU_OTHERS_STALL namd IFU_STALL leslie3d SMT_STALL cactusA OTHER_CMPL_STALL gromacs zeusmp GCT_IC_MISS milc GCT_BR_MP gamess GCT_BR_MP_IC_MISS gamess GCT_OTHERS gamess bwaves GRP_CMP

0.0 0.5 1.0 1.5 2.0 2.5 CPI

12 © 2011 IBM Corporation CPI stack for different benchmarks… CPI Stack CPI Stack

FXU STALL FXU STALL VSU_SCALAR_STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_REJECT_STALL LSU_DCACHE_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL LSU_OTHERS_STALL IFU_STALL IFU_STALL SMT_STALL SMT_STALL OTHER_CMPL_STALL OTHER_CMPL_STALL ST ST GCT_IC_MISS GCT_IC_MISS GCT_BR_MP GCT_BR_MP GCT_BR_MP_IC_MISS GCT_BR_MP_IC_MISS 0 1 2 3 4 5 6 7 0 2 4 6 8 10 GCT_OTHERS GCT_OTHERS CPI CPI GRP_CMP GRP_CMP bwaves milc

CPI Stack CPI Stack

FXU STALL FXU STALL VSU_SCALAR_STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_REJECT_STALL LSU_DCACHE_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL LSU_OTHERS_STALL IFU_STALL IFU_STALL SMT_STALL SMT_STALL OTHER_CMPL_STALL OTHER_CMPL_STALL ST ST GCT_IC_MISS GCT_IC_MISS GCT_BR_MP GCT_BR_MP GCT_BR_MP_IC_MISS GCT_BR_MP_IC_MISS 0 0.5 1 1.5 2 2.5 0.000000000 0.500000000 1.000000000 1.500000000 2.000000000 GCT_OTHERS GCT_OTHERS CPI CPI GRP_CMP GRP_CMP calculix 13 povray © 2011 IBM Corporation Benchmark Characteristics – Cache Distribution

Cache Distribution

sphinx3_0 wrf_0 lbm_0 tonto_0 GemsFDTD_0 calculix_0 povray_0 soplex_1 soplex_0 PM_DATA_FROM_L2 dealII_0 namd_0 PM_DATA_FROM_L3 leslie3d_0 PM_DATA_FROM_LMEM cactusADM_0 gromacs_0 zeusmp_0 milc_0 gamess_2 gamess_1 gamess_0 bwaves_0

0% 20% 40% 60% 80% 100%

14 © 2011 IBM Corporation Benchmark Characteristics – Load miss rates

Load Miss Rates

sphinx3_0 wrf_0 lbm_0 tonto_0 GemsFDTD_0 calculix_0 povray_0 soplex_1 soplex_0 dealII_0 namd_0 leslie3d_0 cactusADM_0 gromacs_0 zeusmp_0 milc_0 gamess_2 gamess_1 gamess_0 bwaves_0

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

Load Miss Rate

15 © 2011 IBM Corporation Benchmark characteristics – Instruction distribution

Instruction Distribution

100% 90% 80% 70% PM_BRU_FIN 60% LD_INST_FIN 50% PM_ST_FIN 40% FX_INST_FIN 30% PM_VSU_FIN 20% 10% 0%

0 0 1 2 0 0 0 0 0 _ _ _0 _0 _0 _1 _ _ s_ d d_ lII x x ss cs_0 3 m a le le m wrf_ ve ess_ e milc_0 smp_0 a e p onto_0 lb inx3_ a m u m lie a d op lculix_0 t h a o n so s povray_0a FDTD bw g gamessgam ze les c s sp gr ctusADM_0 m a e c G

16 © 2011 IBM Corporation High SMT Gains - namd

namd ST SMT2 SMT4 CPI 0.78 1.14 1.76 SMT Gain(%) - 37.22 78.54

CPI Stack

FXU STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL IFU_STALL SMT_STALL OTHER_CMPL_STALL ST GCT_IC_MISS GCT_BR_MP GCT_BR_MP_IC_MISS 0.0 0.5 1.0 1.5 2.0 GCT_OTHERS CPI GRP_CMP

17 © 2011 IBM Corporation Negative SMT Gains Benchmark : milc

milc ST SMT2 SMT4 CPI 2.03 4.19 8.55 SMT Gain(%) - -3.01 -4.92

CPI Stack FXU STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL IFU_STALL SMT_STALL ST OTHER_CMPL_STALL GCT_IC_MISS GCT_BR_MP 0 2 4 6 8 10 GCT_BR_MP_IC_MISS GCT_OTHERS CPI GRP_CMP

18 © 2011 IBM Corporation Acknowledgements

 Alan Mackay for all help related to SPEC CPU2006  Rajeev Indukuru for generous assistance with leveraging the PMU

19 © 2011 IBM Corporation Special Notices

 IBM, the IBM logo, ibm.com AIX, AIX (logo), AIX 6 (logo), AS/400, Active Memory, BladeCenter, Blue Gene, CacheFlow, ClusterProven, DB2, ESCON, i5/OS, i5/OS (logo), IBM Business Partner (logo), IntelliStation, LoadLeveler, Lotus, Lotus Notes, Notes, Operating System/400, OS/400, PartnerLink, PartnerWorld, PowerPC, pSeries, Rational, RISC System/6000, RS/6000, THINK, Tivoli, Tivoli (logo), Tivoli Management Environment, WebSphere, xSeries, z/OS, zSeries, AIX 5L, Chiphopper, Chipkill, Cloudscape, DB2 Universal Database, DS4000, DS6000, DS8000, EnergyScale, Enterprise Workload Manager, General Purpose File System, , GPFS, HACMP, HACMP/6000, HASM, IBM Systems Director Active Energy Manager, iSeries, Micro-Partitioning, POWER, PowerExecutive, PowerVM, PowerVM (logo), PowerHA, Power Architecture, Power Everywhere, Power Family, POWER Hypervisor, Power Systems, Power Systems (logo), Power Systems Software, Power Systems Software (logo), POWER2, POWER3, POWER4, POWER4+, POWER5, POWER5+, POWER6, POWER7, pureScale, System i, System p, System p5, System Storage, System z, Tivoli Enterprise, TME 10, TurboCore, Workload Partitions Manager and X-Architecture are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml  The Power Architecture and Power.org wordmarks and the Power and Power.org logos and related marks are trademarks and service marks licensed by Power.org.  SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).  For a definition/explanation of each benchmark and the full list of detailed results, visit the Web site of the benchmark consortium or benchmark vendor. – SPEC http://www.spec.org  The IBM benchmarks results shown herein were derived using particular, well configured, development-level and generally-available systems. Buyers should consult other sources of information to evaluate the performance of systems they are considering buying and should consider conducting application oriented testing. For additional information about the benchmarks, values and systems tested, contact your local IBM office or IBM authorized reseller or access the Web site of the benchmark consortium or benchmark vendor.  IBM benchmark results can be found in the IBM Power Systems Performance Report at http://www.ibm.com/systems/p/hardware/system_perf.html .

20 © 2011 IBM Corporation Thank you!

21 © 2011 IBM Corporation