Performance Counters and Tools Openpower Tutorial, Sc18, Dallas

PERFORMANCE COUNTERS AND TOOLS OPENPOWER TUTORIAL, SC18, DALLAS 12 November 2018 Andreas Herten Forschungszentrum Jülich Member of the Helmholtz Association Outline Motivation Performance Counters Goals of this session Introduction General Description Get to know Performance Counters Counters on POWER9 Measure counters on POWER9 Measuring Counters ! Hands-on perf Additional material in appendix PAPI GPUs Conclusion Member of the Helmholtz Association 12 November 2018 Slide 1 18 Knuth […] premature optimization is the root of all evil. Yet we should not pass up our [optimization] op- portunities […] – Donald Knuth Full quote in appendix Optimization Measurement Making educated decisions Only optimize code after measuring its performance Measure! Don’t trust your gut! Programming Objectives Run time Cycles Operations per cycle (FLOP=s) Usage of architecture features ($, (S)MT, SIMD, …) Correlate measurements with code Measuring ! hot spots/performance limiters Iterative process Member of the Helmholtz Association 12 November 2018 Slide 3 18 Measurement Native Derived Two options for insight Software Coarse Timestamps to time program / parts of program Only good for first glimpse No insight to inner workings Detailed Performance counters to study usage of hardware architecture Instructions Flushs CPI, IPC Cycles Branches Floating point operations CPU migrations Stalled cycles … Cache misses, cache hits Prefetches Member of the Helmholtz Association 12 November 2018 Slide 4 18 Performance Counters Member of the Helmholtz Association 12 November 2018 Slide 5 18 Performance Monitoring Unit Right next to the core Part of processor periphery, but dedicated registers History First occurrence: Intel Pentium, reverse-engineered 1994 (RDPMC)[2] Originally for chip developers Later embraced for software developers and tuners Operation: Certain events counted at logic level, then aggregated to registers Pros Cons Low overhead Processor-specific Very specific requests possible; detailed Hard to measure information Limited amount of counter registers Information about CPU core, nest, cache, Compressed information content memory Member of the Helmholtz Association 12 November 2018 Slide 6 18 Working with Performance Counters Some caveats Mind the clock rates! Modern processors have dynamic clock rates (CPUs, GPUs) ! Might skew results Some counters might not run at nominal clock rate Limited counter registers POWER9: 6 registers for hardware counters (PMC1 - PMC6)[3] Cores, Threads (OpenMP) Absolutely possible Complicates things slightly Pinning necessary ! OMP_PROC_BIND, OMP_PLACES; PAPI_thread_init() Nodes (MPI): Counters independent of MPI, but aggregation tool useful (Score-P, …) Member of the Helmholtz Association 12 November 2018 Slide 7 18 Performance Counters on POWER9 Member of the Helmholtz Association 12 November 2018 Slide 8 18 POWER9 Compartments Sources of PMU events POWER9 Core-Level Nest-Level Core / thread level L3 cache, interconnect fabric, Core pipeline analysis memory channels Frontend Analysis of Branch prediction Main memory access Execution units Bandwidth … Behavior investigation Stalls Utilization … Member of the Helmholtz Association 12 November 2018 Slide 9 18 LWSYNC PM_CMPLU_STALL_LWSYNC Glossary HWSYNC BRU Branching Unit PM_CMPLU_STALL_HWSYNC CR Conditional Register ECC Delay FXU Fixed-Point Unit PM_CMPLU_STALL_MEM_ECC_DELAY Thread Blocked VSU Vector-Scalar Unit PM_CMPLU_STALL_THRD Other Thread’s Flush LSU Load-Store Unit PM_CMPLU_STALL_FLUSH LMQ Load Miss Queue COQ Full ERAT Effective to Real Address POWER9 Performance Counters PM_CMPLU_STALL_COQ_FULL Translation LWSYNC Lightweight Synchro- Other nize Branch HWSYNC Heavyweight Synchro- BR or CR PM_CMPLU_STALL_BRU nize PM_CMPLU_STALL_BRU_CRU CR ECC Error Correcting Code Group Waiting to Complete Fixed-Point Long Derived Quantity PM_NTCG_ALL_FIN Fixed-Point PM_CMPLU_STALL_FXLONG PM_CMPLU_STALL_FXU Instructions, Stalls Fixed-Point (Other) Stall due to… Vector Long Thread blocked due to… Vector PM_CMPLU_STALL_VECTOR_LONG PM_CMPLU_STALL_VECTOR Vector (other) Nothing to dispatch due to… Scalar Long VSU Scalar PM_CMPLU_STALL_SCALAR_LONG PM_CMPLU_ST ALL_VSU PM_CMPLU_STALL_SCALAR Scalar (other) Other Load-Hit-Store Stalled Cycles Next-to-Complete Flush Store Finish PM_CMPLU_STALL_REJECT_LHS PM_CMPLU_STALL PM_CMPLU_STALL_NTCG_FLUSH PM_CMPLU_STALL_STORE ERAT Miss LSU Reject PM_CMPLU_STALL_ERAT_MIS NOPs PM_CMPLU_STALL_REJECT PM_CMPLU_STALL_NO_NTF LMQ Full PM_CMPLU_STALL_REJ_LMQ_FULL Load Finish L2/L3 Hit PM_CMPLU_STALL_LOAD_FINISH Cycles LSU Other with Conflict PM_CMPLU_STALL_DMISS_L2L3_CONFLICT PM_RUN_CYC PM_CMPLU_STALL_LSU Store Forward L2/L3 Hit PM_CMPLU_STALL_ST_FWD L2/L3 Hit Other PM_CMPLU_STALL_DMISS_L2L3 Dcache Miss w/ No Conflict PM_CMPLU_STALL_OTHER_CMPL PM_CMPLU_STALL_DCACHE_MISS L3 Miss On-Chip L2/L3 PM_LD_MISS_L1 Load missed L1 cache Other PM_CMPLU_STALL_DMISS_L3MISS PM_CMPLU_STALL_DMISS_L21_L31 Run Instruction PM_RUN_INST_CMPL On-Chip Memory L3 Miss PM_CMPLU_STALL_DMISS_LMEM PM_GCT_NOSLOT_IC_L3MISS I-Cache Miss Off-Chip Memory Store: PM_ST_MISS_L1; Local L4 Hit: PM_DATA_FROM_LL4 PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_DMISS_REMOTE Other Branch Mispredict Off-Node Memory PM_GCT_NOSLOT_BR_MPRED Dispatch Held: Mapper Nothing to Dispatch Branch Mispredict PM_GCT_NOSLOT_DISP_HELD_MAP PM_GCT_NOSLOT_CYC and I-Cache Miss PM_GCT_NOSLOT_BR_MPRED_ICMISS Dispatch Held: Store Queue PM_GCT_NOSLOT_DISP_HELD_SRQ PM_INST_CMPL Instructions completed Dispatch Held Dispatch Held: Issue Queue PM_GCT_NOSLOT_DISP_HELD_ISSQ Other Other OTHER_CPI Dispatch Held: Other Also: PM_RUN_INST_CMPL PM_GCT_NOSLOT_DISP_HELD_OTHER PM_VECTOR_FLOP_CMPL Vector FP instruction completed Also: PM_2FLOP_CMPL POWER8 CPI Stack PM_RUN_CYC Total cycles run Processor cycles gated by the run latch PM_CMPLU_STALL Completion stall Cycles in which a thread did not complete any groups, but there were entries PM_CMPLU_STALL_THRD Completion stall due to thread conflict Completion stalled because the thread was blocked PM_CMPLU_STALL_BRU Stall due to BRU BRU: Branch Unit PM_CMPLU_STALL_LSU Completion stall by LSU instruction LSU: Load/Store Unit Member of the Helmholtz Association 12 November 2018 Slide 10 18 LWSYNC PM_CMPLU_STALL_LWSYNC Glossary HWSYNC BRU Branching Unit PM_CMPLU_STALL_HWSYNC CR Conditional Register ECC Delay FXU Fixed-Point Unit PM_CMPLU_STALL_MEM_ECC_DELAY Thread Blocked VSU Vector-Scalar Unit PM_CMPLU_STALL_THRD Other Thread’s Flush LSU Load-Store Unit PM_CMPLU_STALL_FLUSH LMQ Load Miss Queue COQ Full ERAT Effective to Real Address POWER9 Performance Counters PM_CMPLU_STALL_COQ_FULL Translation LWSYNC Lightweight Synchro- Other nize Branch HWSYNC Heavyweight Synchro- BR or CR PM_CMPLU_STALL_BRU nize PM_CMPLU_STALL_BRU_CRU CR ECC Error Correcting Code Group Waiting to Complete Fixed-Point Long Derived Quantity PM_NTCG_ALL_FIN Fixed-Point PM_CMPLU_STALL_FXLONG PM_CMPLU_STALL_FXU Instructions, Stalls Fixed-Point (Other) Stall due to… Vector Long Thread blocked due to… Vector PM_CMPLU_STALL_VECTOR_LONG PM_CMPLU_STALL_VECTOR Vector (other) Nothing to dispatch due to… Scalar Long VSU Scalar PM_CMPLU_STALL_SCALAR_LONG PM_CMPLU_ST ALL_VSU PM_CMPLU_STALL_SCALAR Scalar (other) Other Load-Hit-Store Stalled Cycles Next-to-Complete Flush Store Finish PM_CMPLU_STALL_REJECT_LHS PM_CMPLU_STALL PM_CMPLU_STALL_NTCG_FLUSH PM_CMPLU_STALL_STORE ERAT Miss LSU Reject PM_CMPLU_STALL_ERAT_MIS NOPs PM_CMPLU_STALL_REJECT PM_CMPLU_STALL_NO_NTF LMQ Full PM_CMPLU_STALL_REJ_LMQ_FULL Load Finish L2/L3 Hit PM_CMPLU_STALL_LOAD_FINISH Cycles LSU Other with Conflict PM_CMPLU_STALL_DMISS_L2L3_CONFLICT PM_RUN_CYC PM_CMPLU_STALL_LSU Store Forward L2/L3 Hit PM_CMPLU_STALL_ST_FWD L2/L3 Hit Other PM_CMPLU_STALL_DMISS_L2L3 Dcache Miss w/ No Conflict PM_CMPLU_STALL_OTHER_CMPL PM_CMPLU_STALL_DCACHE_MISS L3 Miss On-Chip L2/L3 PM_LD_MISS_L1 Load missed L1 cache Other PM_CMPLU_STALL_DMISS_L3MISS PM_CMPLU_STALL_DMISS_L21_L31 Run Instruction PM_RUN_INST_CMPL On-Chip Memory L3 Miss PM_CMPLU_STALL_DMISS_LMEM PM_GCT_NOSLOT_IC_L3MISS I-Cache Miss Off-Chip Memory Store: PM_ST_MISS_L1; Local L4 Hit: PM_DATA_FROM_LL4 PM_GCT_NOSLOT_IC_MISS PM_CMPLU_STALL_DMISS_REMOTE Other Branch Mispredict Off-Node Memory PM_GCT_NOSLOT_BR_MPRED Dispatch Held: Mapper Nothing to Dispatch Branch Mispredict PM_GCT_NOSLOT_DISP_HELD_MAP PM_GCT_NOSLOT_CYC and I-Cache Miss PM_GCT_NOSLOT_BR_MPRED_ICMISS Dispatch Held: Store Queue PM_GCT_NOSLOT_DISP_HELD_SRQ PM_INST_CMPL Instructions completed Dispatch Held Dispatch Held: Issue Queue PM_GCT_NOSLOT_DISP_HELD_ISSQ Other Other OTHER_CPI Dispatch Held: Other Also:NumberPM_RUN_INST_CMPL of counters for POWER9: PM_GCT_NOSLOT_DISP_HELD_OTHER PM_VECTOR_FLOP_CMPL Vector FP instruction completed Also: PM_2FLOP_CMPL POWER8 CPI Stack PM_RUN_CYC Total cycles run 959 Processor cycles gated by the run latch PM_CMPLU_STALL Completion stall Cycles in which a thread did not complete any groups, but there were entries PM_CMPLU_STALL_THRD Completion stallSee dueappendix to thread for more conflict on counters Completion stalled because the thread was blocked PM_CMPLU_STALL_BRU Stall due to BRU (CPI stack; resources) BRU: Branch Unit PM_CMPLU_STALL_LSU Completion stall by LSU instruction LSU: Load/Store Unit Member of the Helmholtz Association 12 November 2018 Slide 10 18 Measuring Counters Member of the Helmholtz Association 12 November 2018 Slide 11 18 Overview perf Linux’ tool (also called perf_events) PAPI C/C++ API Score-P Measurement environment (appendix) Likwid Set of command line utilities for detailed analysis perf_event_open() Linux

Load more