Specfp®2006 CPI Stack & SMT Benefits Analysis on POWER7

Specfp®2006 CPI Stack & SMT Benefits Analysis on POWER7

IBM Systems & Technology Group, IBM India May 2011 SPECfp ®2006 CPI Stack & SMT Benefits Analysis on POWER7 systems Satish Kumar Sadasivam Prathiba Kumar IBM India Systems and Technology Labs, Bangalore Email: [email protected] , [email protected] © 2011 IBM Corporation POWER7 Processor Chip Cores : 8 ( 4 / 6 core options ) Local SMP Links 567mm 2 Technology: – 45nm lithography, Cu, SOI, eDRAM POWER7 POWER7 POWER7 POWER7 CORE F CORE CORE CORE A Transistors: 1.2 B S – Equivalent function of 2.7B T L2 Cache L2 Cache L2 Cache L2 Cache – eDRAM efficiency L3 REGION Eight processor cores MC0 L3 Cache and MC1 – 12 execution units per core Chip Interconnect – 4 Way SMT per core – up to 4 threads per core L2 Cache L2 Cache L2 Cache L2 Cache – 32 Threads per chip – L1: 32 KB I Cache / 32 KB D Cache POWER7 POWER7 POWER7 POWER7 – L2: 256 KB per core CORE CORE CORE CORE – L3: Shared 32MB on chip eDRAM Dual DDR3 Memory Controllers Remote SMP & I/O Links – 100 GB/s Memory bandwidth per chip Scalability up to 32 Sockets Binary Compatibility with – 360 GB/s SMP bandwidth/chip POWER6 – 20,000 coherent operations in flight 2 © 2011 IBM Corporation 3 © 2011 IBM Corporation POWER7 PMU POWER7 has a powerful Performance Instrumentation on the chip forming the Performance Monitoring Unit(PMU). 8 Chip level self contained PMUlets distributed across the chip PMU Events: – PMU events span a wide spectrum of events – Allow monitoring program characteristics(types of instructions, etc) and architectural and microarchitectural behaviour(FP overflow, exceptions, reorder queue full, etc). Performance Monitor Counters (PMCs): – Each thread has 6 Performance Monitor Counters (PMCs). – Each PMC is 32bit wide. – By connecting adjacent PMCs, PMC1-4 can be used as 32 x N (N=1 to 4)bit counter. – 4 64-bit core level counters – 2 programmable and 2 non-programmable Programming the PMCs: – PMC1-4 are programmable. – PMC5 is a dedicated counter for Run Instructions (finished PowerPC instructions gated by the run latch). – PMC6 is a dedicated counter for Run Cycles (Cycles gated by the run latch). 4 © 2011 IBM Corporation 5 © 2011 IBM Corporation SPECfp®_rate2006 Results SPECfp_rate2006 482.sphinx3 481.wrf 470.lbm 465.tonto 459.GemsFDTD 454.calculix 453.povray 450.soplex 447.dealII SMT4 444.namd SMT2 437.leslie3d ST 436.cactusADM 435.gromacs 434.zeusmp 433.milc 416.gamess 410.bwaves 0 100 200 300 400 500 600 SPECint, SPECfp and SPECrate are trademarks of the Standard Performance Evaluation Corp (SPEC). Measurements are on a single chip experimental POWER7 system 6 © 2011 IBM Corporation Benchmarks’ CPI(Thread level) in each SMT mode Benchmark Workload ST CPI SMT2 CPI SMT4 CPI bwaves 0 1.44 2.92 5.94 gamess 0 0.70 1.08 1.93 gamess 1 0.58 0.92 1.66 gamess 2 0.58 0.91 1.61 milc 0 2.03 4.19 8.55 zeusmp 0 0.64 1.15 2.27 gromacs 0 0.65 1.07 1.89 cactusADM 0 0.96 1.76 3.53 leslie3d 0 1.14 2.46 5.07 namd 0 0.78 1.14 1.76 dealII 0 0.93 1.34 2.31 soplex 0 2.45 5.54 13.03 soplex 1 2.07 3.81 7.39 povray 0 0.85 1.25 2.11 calculix 0 0.56 0.92 1.67 GemsFDTD 0 2.11 4.32 8.66 tonto 0 0.72 1.17 2.32 lbm 0 1.17 2.30 4.69 wrf 0 0.65 1.15 2.29 sphinx3 0 1.08 2.32 4.54 7 © 2011 IBM Corporation Calculating SMT gains SMT gains are calculated in terms of CPI Benchmark Workload SMT2 % Gain SMT4 % Gain bwaves 0 -0.97 -2.70 SMT Gain(%) Formula: gamess 0 28.56 43.87 gamess 1 25.99 39.88 gamess 2 27.09 43.18 milc 0 -3.01 -4.92 SMT Gain(%) = 100*( x CPI ST / CPI SMT ) –100 zeusmp 0 12.19 13.40 where, gromacs 0 20.61 36.87 x is the number of SMT threads cactusADM 0 8.54 8.36 leslie3d 0 -7.47 -10.05 CPI ST = (cycles ST /instructions ST ) namd 0 37.22 78.54 CPI SMT = (cycles SMT /instructions SMT ) dealII 0 38.59 60.90 soplex 0 -11.46 -24.79 soplex 1 8.81 12.27 povray 0 36.42 61.59 Example: calculix 0 22.14 34.30 gromacs GemsFDTD 0 -2.37 -2.54 ST CPI - 0.65; SMT4 CPI – 1.89 tonto 0 23.67 24.93 lbm 0 1.71 -0.29 SMT Gain(%) = 100*(4 * 0.65 / 1.89) – 100 wrf 0 13.07 13.18 = 36.87% sphinx3 0 -6.52 -4.43 8 © 2011 IBM Corporation SMT Gains ordering 50.00 40.00 30.00 20.00 10.00 0.00 -10.00 -20.00 dealII_0 namd_0 100 povray_0 80 gamess_0 60 40 gamess_2 SMT2 Gain 20 gamess_1 0 -20 tonto_0 -40 namd_0 calculix_0 dealII_0 9 gromacs_0 povray_0 wrf_0 gamess_0 zeusmp_0 gamess_2 soplex_1 SMT4 Gain gamess_1 cactusADM_0 gromacs_0 lbm_0 calculix_0 bwaves_0 to o 0 nt _ GemsFDTD_0 soplex_1 milc_0 zeusmp_0 sphinx3_0 wrf_0 leslie3d_0 cactusADM_0 soplex_0 SMT2 Gain lbm_0 GemsFDTD_0 bwaves_0 sphinx3_0 milc_0 le slie3d_0 soplex_0 © 2011 IBM Corporation SMT4 Gain SMT Gain categories Benchmark Workload SMT2 % Gain SMT4 % Gain bwaves 0 -0.97 -2.70 Benchmark SMT Gain Categories: gamess 0 28.56 43.87 gamess 1 25.99 39.88 – High Gain gamess 2 27.09 43.18 – Medium Gain milc 0 0 -3.01 – Minimal Gain/No Loss zeusmp 0 12.19 13.40 – Loss gromacs 0 20.61 36.87 cactusADM 0 8.54 8.36 leslie3d 0 -7.47 -10.05 namd 0 37.22 78.54 dealII 0 38.59 60.90 soplex 0 -11.46 -24.79 soplex 1 8.81 12.27 povray 0 36.42 61.59 calculix 0 22.14 34.30 GemsFDTD 0 -2.37 -2.54 tonto 0 23.67 24.93 lbm 0 1.71 -0.29 wrf 0 13.07 13.18 sphinx3 0 -6.52 -4.43 10 © 2011 IBM Corporation CPI Stack for POWER7 – provides the breakdown of average cycles-per-instruction in terms of CPU resource usage – Valuable way of understanding performance aspects of a program execution Stall by FXU <C2> FXU Multi-Cycle Instruction <C2A> (PM_CMPLU_STALL_DIV) (PM_CMPLU_STALL_FXU) FXU Other <C2 - C2A> (PM_CMPLU_STALL_FXU_OTHER) Stall By Scalar Long <C3C1> Stall By Scalar <C3C> (PM_CMPLU_STALL_SCALAR_LONG) (PM_CMPLU_STALL_SCALAR) Stall by VSU <C3> Stall By Scalar Other <C3C2: C3C-C3C1> (C3A +C3B +C3C) (PM_CMPLU_STALL_SCALAR_OTHER) (PM_CMPLU_STALL_VSU) Stall by Vector Long <C3B1> (PM_CMPLU_STALL_VECTOR_LONG) Stall by Vector <C3B> Stall By Vector Other <C3B2: C3B-C3B1> (PM_CMPLU_STALL_VECTOR) (PM_CMPLU_STALL_VECTOR_OTHER) Stall by DFU <C3A> (PM_CMPLU_STALL_DFP) Completion Stall Cycles <C> (PM_CMPLU_STALL) Stall by Reject <C1A> Translation Stall <C1A1> (PM_CMPLU_STALL_REJECT) (PM_CMPLU_STALL_ERAT_MISS) Stall by LSU <C1> Other Reject <C1A2: C1A - C1A1> Cycles (PM_CMPLU_STALL_LSU) (PM_CMPLU_STALL_REJECT_OTHER) (PM_RUN_CYC) Stall by D-Cache Miss <C1B> (PM_CMPLU_STALL_DCACHE_MISS) Stall by Store < C1C> (PM_CMPLU_STALL_STORE) LSU Other <C1C: C1 - C1A - C1B - C1C> (PM_CMPLU_STALL_LSU_OTHER) Stall due to SMT <C4> (PM_CMPLU_STALL_THRD) Stall due to BRU <C5A> (PM_CMPLU_STALL_BRU) Stall due to IFU <C5> (PM_CMPLU_STALL_IFU) Other IFU Stall <C5B: C5 - C5A> (PM_CMPLU_STALL_IFU_OTHER) Other Stall <C6: C - C1 - C2 -C3 - C4 - C5> (PM_CMPLU_STALL_OTHER) GCT Empty due to Icache Miss <B1> (PM_GCT_NOSLOT_IC_MISS) GCT Empty Cycles <B> GCT Empty due to Branch Mispredict <B2> (PM_GCT_NOSLOT_BR_MPRED) ( PM_GCT_NOSLOT_CYC) GCT Empty due Branch MisPrdict and Icache Miss <B3> PM_GCT_NOSLOT_BR_MPRED_IC_MISS) GCT Empty other <B - B1 - B2 - B3> (PM_GCT_EMPTY_OTHER) Base Completion Cycles <A1> (PM_1PLUS_PPC_CMPL) Completion Cycles (Group Completed) <A> (PM_GRP_CMPL) Overhead of expansion <A - A1> 11 © 2011 IBM Corporation CPI Stack for SPEC FP benchmarks CPI Stack - ST sphinx3 FXU STALL wrf VSU_SCALAR_STALL lbm tonto VSU_VECTOR_STALL GemsF VSU_DFU_STALL calculix LSU_REJECT_STALL povray LSU_DCACHE_STALL soplex soplex LSU_STORE_STALL dealII LSU_OTHERS_STALL namd IFU_STALL leslie3d SMT_STALL cactusA OTHER_CMPL_STALL gromacs zeusmp GCT_IC_MISS milc GCT_BR_MP gamess GCT_BR_MP_IC_MISS gamess GCT_OTHERS gamess bwaves GRP_CMP 0.0 0.5 1.0 1.5 2.0 2.5 CPI 12 © 2011 IBM Corporation CPI stack for different benchmarks… CPI Stack CPI Stack FXU STALL FXU STALL VSU_SCALAR_STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_REJECT_STALL LSU_DCACHE_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL LSU_OTHERS_STALL IFU_STALL IFU_STALL SMT_STALL SMT_STALL OTHER_CMPL_STALL OTHER_CMPL_STALL ST ST GCT_IC_MISS GCT_IC_MISS GCT_BR_MP GCT_BR_MP GCT_BR_MP_IC_MISS GCT_BR_MP_IC_MISS 0 1 2 3 4 5 6 7 0 2 4 6 8 10 GCT_OTHERS GCT_OTHERS CPI CPI GRP_CMP GRP_CMP bwaves milc CPI Stack CPI Stack FXU STALL FXU STALL VSU_SCALAR_STALL VSU_SCALAR_STALL SMT4 VSU_VECTOR_STALL SMT4 VSU_VECTOR_STALL VSU_DFU_STALL VSU_DFU_STALL LSU_REJECT_STALL LSU_REJECT_STALL LSU_DCACHE_STALL LSU_DCACHE_STALL SMT2 LSU_STORE_STALL SMT2 LSU_STORE_STALL LSU_OTHERS_STALL LSU_OTHERS_STALL IFU_STALL IFU_STALL SMT_STALL SMT_STALL OTHER_CMPL_STALL OTHER_CMPL_STALL ST ST GCT_IC_MISS GCT_IC_MISS GCT_BR_MP GCT_BR_MP GCT_BR_MP_IC_MISS GCT_BR_MP_IC_MISS 0 0.5 1 1.5 2 2.5 0.000000000 0.500000000 1.000000000 1.500000000 2.000000000 GCT_OTHERS GCT_OTHERS CPI CPI GRP_CMP GRP_CMP calculix 13 povray © 2011 IBM Corporation Benchmark Characteristics – Cache Distribution Cache Distribution sphinx3_0 wrf_0 lbm_0 tonto_0 GemsFDTD_0 calculix_0 povray_0 soplex_1 soplex_0 PM_DATA_FROM_L2 dealII_0 namd_0 PM_DATA_FROM_L3 leslie3d_0 PM_DATA_FROM_LMEM cactusADM_0 gromacs_0 zeusmp_0 milc_0 gamess_2 gamess_1 gamess_0 bwaves_0 0% 20% 40% 60% 80% 100% 14 © 2011 IBM Corporation Benchmark Characteristics – Load miss rates Load Miss Rates sphinx3_0 wrf_0 lbm_0 tonto_0 GemsFDTD_0 calculix_0 povray_0 soplex_1 soplex_0 dealII_0 namd_0 leslie3d_0 cactusADM_0 gromacs_0 zeusmp_0 milc_0 gamess_2 gamess_1 gamess_0 bwaves_0 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 Load Miss Rate 15 © 2011 IBM Corporation Benchmark characteristics – Instruction distribution 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Instruction Distribution bwaves_0 gamess_0 gamess_1 16 gamess_2 milc_0 zeusmp_0 gromacs_0 cactusADM_0 leslie3d_0 namd_0

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    21 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us