Specfp®2006 CPI Stack & SMT Benefits Analysis on POWER7

IBM Systems & Technology Group, IBM India May 2011 SPECfp ®2006 CPI Stack & SMT Benefits Analysis on POWER7 systems Satish Kumar Sadasivam Prathiba Kumar IBM India Systems and Technology Labs, Bangalore Email: [email protected] , [email protected] © 2011 IBM Corporation POWER7 Processor Chip Cores : 8 ( 4 / 6 core options ) Local SMP Links 567mm 2 Technology: – 45nm lithography, Cu, SOI, eDRAM POWER7 POWER7 POWER7 POWER7 CORE F CORE CORE CORE A Transistors: 1.2 B S – Equivalent function of 2.7B T L2 Cache L2 Cache L2 Cache L2 Cache – eDRAM efficiency L3 REGION Eight processor cores MC0 L3 Cache and MC1 – 12 execution units per core Chip Interconnect – 4 Way SMT per core – up to 4 threads per core L2 Cache L2 Cache L2 Cache L2 Cache – 32 Threads per chip – L1: 32 KB I Cache / 32 KB D Cache POWER7 POWER7 POWER7 POWER7 – L2: 256 KB per core CORE CORE CORE CORE – L3: Shared 32MB on chip eDRAM Dual DDR3 Memory Controllers Remote SMP & I/O Links – 100 GB/s Memory bandwidth per chip Scalability up to 32 Sockets Binary Compatibility with – 360 GB/s SMP bandwidth/chip POWER6 – 20,000 coherent operations in flight 2 © 2011 IBM Corporation 3 © 2011 IBM Corporation POWER7 PMU POWER7 has a powerful Performance Instrumentation on the chip forming the Performance Monitoring Unit(PMU). 8 Chip level self contained PMUlets distributed across the chip PMU Events: – PMU events span a wide spectrum of events – Allow monitoring program characteristics(types of instructions, etc) and architectural and microarchitectural behaviour(FP overflow, exceptions, reorder queue full, etc). Performance Monitor Counters (PMCs): – Each thread has 6 Performance Monitor Counters (PMCs). – Each PMC is 32bit wide. – By connecting adjacent PMCs, PMC1-4 can be used as 32 x N (N=1 to 4)bit counter. – 4 64-bit core level counters – 2 programmable and 2 non-programmable Programming the PMCs: – PMC1-4 are programmable. – PMC5 is a dedicated counter for Run Instructions (finished PowerPC instructions gated by the run latch). – PMC6 is a dedicated counter for Run Cycles (Cycles gated by the run latch). 4 © 2011 IBM Corporation 5 © 2011 IBM Corporation SPECfp®_rate2006 Results SPECfp_rate2006 482.sphinx3 481.wrf 470.lbm 465.tonto 459.GemsFDTD 454.calculix 453.povray 450.soplex 447.dealII SMT4 444.namd SMT2 437.leslie3d ST 436.cactusADM 435.gromacs 434.zeusmp 433.milc 416.gamess 410.bwaves 0 100 200 300 400 500 600 SPECint, SPECfp and SPECrate are trademarks of the Standard Performance Evaluation Corp (SPEC). Measurements are on a single chip experimental POWER7 system 6 © 2011 IBM Corporation Benchmarks’ CPI(Thread level) in each SMT mode Benchmark Workload ST CPI SMT2 CPI SMT4 CPI bwaves 0 1.44 2.92 5.94 gamess 0 0.70 1.08 1.93 gamess 1 0.58 0.92 1.66 gamess 2 0.58 0.91 1.61 milc 0 2.03 4.19 8.55 zeusmp 0 0.64 1.15 2.27 gromacs 0 0.65 1.07 1.89 cactusADM 0 0.96 1.76 3.53 leslie3d 0 1.14 2.46 5.07 namd 0 0.78 1.14 1.76 dealII 0 0.93 1.34 2.31 soplex 0 2.45 5.54 13.03 soplex 1 2.07 3.81 7.39 povray 0 0.85 1.25 2.11 calculix 0 0.56 0.92 1.67 GemsFDTD 0 2.11 4.32 8.66 tonto 0 0.72 1.17 2.32 lbm 0 1.17 2.30 4.69 wrf 0 0.65 1.15 2.29 sphinx3 0 1.08 2.32 4.54 7 © 2011 IBM Corporation Calculating SMT gains SMT gains are calculated in terms of CPI Benchmark Workload SMT2 % Gain SMT4 % Gain bwaves 0 -0.97 -2.70 SMT Gain(%) Formula: gamess 0 28.56 43.87 gamess 1 25.99 39.88 gamess 2 27.09 43.18 milc 0 -3.01 -4.92 SMT Gain(%) = 100*( x CPI ST / CPI SMT ) –100 zeusmp 0 12.19 13.40 where, gromacs 0 20.61 36.87 x is the number of SMT threads cactusADM 0 8.54 8.36 leslie3d 0 -7.47 -10.05 CPI ST = (cycles ST /instructions ST ) namd 0 37.22 78.54 CPI SMT = (cycles SMT /instructions SMT ) dealII 0 38.59 60.90 soplex 0 -11.46 -24.79 soplex 1 8.81 12.27 povray 0 36.42 61.59 Example: calculix 0 22.14 34.30 gromacs GemsFDTD 0 -2.37 -2.54 ST CPI - 0.65; SMT4 CPI – 1.89 tonto 0 23.67 24.93 lbm 0 1.71 -0.29 SMT Gain(%) = 100*(4 * 0.65 / 1.89) – 100 wrf 0 13.07 13.18 = 36.87% sphinx3 0 -6.52 -4.43 8 © 2011 IBM Corporation SMT Gains ordering 50.00 40.00 30.00 20.00 10.00 0.00 -10.00 -20.00 dealII_0 namd_0 100 povray_0 80 gamess_0 60 40 gamess_2 SMT2 Gain 20 gamess_1 0 -20 tonto_0 -40 namd_0 calculix_0 dealII_0 9 gromacs_0 povray_0 wrf_0 gamess_0 zeusmp_0 gamess_2 soplex_1 SMT4 Gain gamess_1 cactusADM_0 gromacs_0 lbm_0 calculix_0 bwaves_0 to o 0 nt _ GemsFDTD_0 soplex_1 milc_0 zeusmp_0 sphinx3_0 wrf_0 leslie3d_0 cactusADM_0 soplex_0 SMT2 Gain lbm_0 GemsFDTD_0 bwaves_0 sphinx3_0 milc_0 le slie3d_0 soplex_0 © 2011 IBM Corporation SMT4 Gain SMT Gain categories Benchmark Workload SMT2 % Gain SMT4 % Gain bwaves 0 -0.97 -2.70 Benchmark SMT Gain Categories: gamess 0 28.56 43.87 gamess 1 25.99 39.88 – High Gain gamess 2 27.09 43.18 – Medium Gain milc 0 0 -3.01 – Minimal Gain/No Loss zeusmp 0 12.19 13.40 – Loss gromacs 0 20.61 36.87 cactusADM 0 8.54 8.36 leslie3d 0 -7.47 -10.05 namd 0 37.22 78.54 dealII 0 38.59 60.90 soplex 0 -11.46 -24.79 soplex 1 8.81 12.27 povray 0 36.42 61.59 calculix 0 22.14 34.30 GemsFDTD 0 -2.37 -2.54 tonto 0 23.67 24.93 lbm 0 1.71 -0.29 wrf 0 13.07 13.18 sphinx3 0 -6.52 -4.43 10 © 2011 IBM Corporation CPI Stack for POWER7 – provides the breakdown of average cycles-per-instruction in terms of CPU resource usage – Valuable way of understanding performance aspects of a program execution Stall by FXU <C2> FXU Multi-Cycle Instruction <C2A> (PM_CMPLU_STALL_DIV) (PM_CMPLU_STALL_FXU) FXU Other <C2 - C2A> (PM_CMPLU_STALL_FXU_OTHER) Stall By Scalar Long <C3C1> Stall By Scalar <C3C> (PM_CMPLU_STALL_SCALAR_LONG) (PM_CMPLU_STALL_SCALAR) Stall by VSU <C3> Stall By Scalar Other <C3C2: C3C-C3C1> (C3A +C3B +C3C) (PM_CMPLU_STALL_SCALAR_OTHER) (PM_CMPLU_STALL_VSU) Stall by Vector Long <C3B1> (PM_CMPLU_STALL_VECTOR_LONG) Stall by Vector <C3B> Stall By Vector Other <C3B2: C3B-C3B1> (PM_CMPLU_STALL_VECTOR) (PM_CMPLU_STALL_VECTOR_OTHER) Stall by DFU <C3A> (PM_CMPLU_STALL_DFP) Completion Stall Cycles <C> (PM_CMPLU_STALL) Stall by Reject <C1A> Translation Stall <C1A1> (PM_CMPLU_STALL_REJECT) (PM_CMPLU_STALL_ERAT_MISS) Stall by LSU <C1> Other Reject <C1A2: C1A - C1A1> Cycles (PM_CMPLU_STALL_LSU) (PM_CMPLU_STALL_REJECT_OTHER) (PM_RUN_CYC) Stall by D-Cache Miss <C1B> (PM_CMPLU_STALL_DCACHE_MISS) Stall by Store < C1C> (PM_CMPLU_STALL_STORE) LSU Other <C1C: C1 - C1A - C1B - C1C> (PM_CMPLU_STALL_LSU_OTHER) Stall due to SMT <C4> (PM_CMPLU_STALL_THRD) Stall due to BRU <C5A> (PM_CMPLU_STALL_BRU) Stall due to IFU <C5> (PM_CMPLU_STALL_IFU) Other IFU Stall <C5B: C5 - C5A> (PM_CMPLU_STALL_IFU_OTHER) Other Stall <C6: C - C1 - C2 -C3 - C4 - C5> (PM_CMPLU_STALL_OTHER) GCT Empty due to Icache Miss <B1> (PM_GCT_NOSLOT_IC_MISS) GCT Empty Cycles <B> GCT Empty due to Branch Mispredict <B2> (PM_GCT_NOSLOT_BR_MPRED) ( PM_GCT_NOSLOT_CYC) GCT Empty due Branch MisPrdict and Icache Miss <B3> PM_GCT_NOSLOT_BR_MPRED_IC_MISS) GCT Empty other <B - B1 - B2 - B3> (PM_GCT_EMPTY_OTHER) Base Completion Cycles <A1> (PM_1PLUS_PPC_CMPL) Completion Cycles (Group Completed) <A> (PM_GRP_CMPL) Overhead of expansion <A - A1> 11 © 2011 IBM Corporation CPI Stack for SPEC FP benchmarks CPI Stack - ST sphinx3 FXU STALL wrf VSU_SCALAR_STALL lbm tonto VSU_VECTOR_STALL GemsF VSU_DFU_STALL calculix LSU_REJECT_STALL povray LSU_DCACHE_STALL soplex soplex LSU_STORE_STALL dealII LSU_OTHERS_STALL namd IFU_STALL leslie3d SMT_STALL cactusA OTHER_CMPL_STALL gromacs zeusmp GCT_IC_MISS milc GCT_BR_MP gamess GCT_BR_MP_IC_MISS gamess GCT_OTHERS gamess bwaves GRP_CMP 0.0 0.5 1.0 1.5 2.0 2.5 CPI 12 © 2011 IBM Corporation CPI stack for different benchmarks… CPI Stack CPI Stack FXU STALL FXU STALL VSU_SCALAR_STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_REJECT_STALL LSU_DCACHE_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL LSU_OTHERS_STALL IFU_STALL IFU_STALL SMT_STALL SMT_STALL OTHER_CMPL_STALL OTHER_CMPL_STALL ST ST GCT_IC_MISS GCT_IC_MISS GCT_BR_MP GCT_BR_MP GCT_BR_MP_IC_MISS GCT_BR_MP_IC_MISS 0 1 2 3 4 5 6 7 0 2 4 6 8 10 GCT_OTHERS GCT_OTHERS CPI CPI GRP_CMP GRP_CMP bwaves milc CPI Stack CPI Stack FXU STALL FXU STALL VSU_SCALAR_STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_REJECT_STALL LSU_DCACHE_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL LSU_OTHERS_STALL IFU_STALL IFU_STALL SMT_STALL SMT_STALL OTHER_CMPL_STALL OTHER_CMPL_STALL ST ST GCT_IC_MISS GCT_IC_MISS GCT_BR_MP GCT_BR_MP GCT_BR_MP_IC_MISS GCT_BR_MP_IC_MISS 0 0.5 1 1.5 2 2.5 0.000000000 0.500000000 1.000000000 1.500000000 2.000000000 GCT_OTHERS GCT_OTHERS CPI CPI GRP_CMP GRP_CMP calculix 13 povray © 2011 IBM Corporation Benchmark Characteristics – Cache Distribution Cache Distribution sphinx3_0 wrf_0 lbm_0 tonto_0 GemsFDTD_0 calculix_0 povray_0 soplex_1 soplex_0 PM_DATA_FROM_L2 dealII_0 namd_0 PM_DATA_FROM_L3 leslie3d_0 PM_DATA_FROM_LMEM cactusADM_0 gromacs_0 zeusmp_0 milc_0 gamess_2 gamess_1 gamess_0 bwaves_0 0% 20% 40% 60% 80% 100% 14 © 2011 IBM Corporation Benchmark Characteristics – Load miss rates Load Miss Rates sphinx3_0 wrf_0 lbm_0 tonto_0 GemsFDTD_0 calculix_0 povray_0 soplex_1 soplex_0 dealII_0 namd_0 leslie3d_0 cactusADM_0 gromacs_0 zeusmp_0 milc_0 gamess_2 gamess_1 gamess_0 bwaves_0 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 Load Miss Rate 15 © 2011 IBM Corporation Benchmark characteristics – Instruction distribution 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Instruction Distribution bwaves_0 gamess_0 gamess_1 16 gamess_2 milc_0 zeusmp_0 gromacs_0 cactusADM_0 leslie3d_0 namd_0

Specfp®2006 CPI Stack & SMT Benefits Analysis on POWER7

Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

Overview of the SPEC Benchmarks

3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) Zero Ovf

Asustek Computer Inc.: Asus P6T

Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions

Continuous Profiling: Where Have All the Cycles Gone?

Specfp Benchmark Disclosure

Intel Corporation: Lenovo T400 (Intel Core 2 Duo T9900)

Historical Perspective and Further Reading 4.7

An Evaluation of the TRIPS Computer System

Dell Inc.: Poweredge T610 (Intel Xeon X5690, 3.46 Ghz)

A Lightweight Processor Benchmark Mark Claypool Worcester Polytechnic Institute, [email protected]