POWER8 Performance Analysis

Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in..com

#OpenPOWERSummit

Join the conversation at #OpenPOWERSummit 1 Overview . POWER8 Overview . Introduction to Performance Monitoring . Performance Monitoring Features in POWER8 . What’s new in POWER8? . POWER8 Pipeline . CPI Stack overview – Stall Accounting Model . Performance analysis . CPI analysis . Data source analysis . Prefetch control & Prefetch effectiveness . Application level performance analysis . Marked event profiling & performance analysis. . Microarchitecture bottleneck analysis . Core bottleneck analysis using trace tool and scroll pipe.

Join the conversation at #OpenPOWERSummit 2 POWER8 Processor

Join the conversation at #OpenPOWERSummit 3 Improvements over POWER7

Join the conversation at #OpenPOWERSummit 4 Cache Improvements

Join the conversation at #OpenPOWERSummit 5 Cache Bandwidths

Join the conversation at #OpenPOWERSummit 6 Memory Organization

Join the conversation at #OpenPOWERSummit 7 Performance Instrumentation in P8 • Hardware Performance Monitoring is critical to enable performance evaluation of applications/programs on complex performance cores such as POWER8 • POWER8 provides advanced instrumentation capabilities in two layers • Core Instrumentation • Nest level Instrumentation

Core Level Nest Level Performance Performance Monitoring Monitoring

Join the conversation at #OpenPOWERSummit 8 Core Level Performance Monitoring

. Key to root cause performance bottlenecks at core or thread level

. Facilitates monitoring of • Core Pipeline efficiency – frontend, branch prediction, execution units, schedulers, etc • Behavior metrics – stalls, execution rates, utilizations, thread prioritization & resource sharing

. Enables understanding and optimization of application performance at processor and compiler level.

Join the conversation at #OpenPOWERSummit 9 Nest Level Instrumentation . Instrumentation at • L3 Cache, • Interconnect Fabric • Memory channels/controller . Information provided at per-core and chip-level( as against thread-level for core-level counters) . Significance & Usefulness: • Bandwidth Analysis • Key for analyzing the Cloud Virtualized environment performance. • Can be used to effectively monitor the memory and chip level characteristics to employ effective provisioning of the cloud space.

Join the conversation at #OpenPOWERSummit 10 What’s new in POWER8? . Enhanced CPI Stack Cycle Accounting Model . Hotness Table . Branch History Rolling Buffer . Event-Based Branches . Prefetch effectiveness events . Additional Events to capture & analyze hardware level performance issues

Join the conversation at #OpenPOWERSummit 11 POWER8 Microarchitecture

Join the conversation at #OpenPOWERSummit 12 POWER8 Core Pipeline

Front end stalls: cycles a thread’s GCT was empty , i.e. pipeline was empty for that thread. Back end stalls: cycles thread had GCT entries but no completion occurred.

Join the conversation at #OpenPOWERSummit 13 POWER8 Group Formation

. Group formation: • Instructions are formed into groups for dispatch and completion tracking after Instruction Fetch. • Thread priority logic selects up to 8 instructions from the Instruction buffers for group formation in each cycle • Group formation driven by group formation rules . Global Completion Table(GCT) . Completion based performance bottleneck analysis

Join the conversation at #OpenPOWERSummit 14 CPI Analysis . Cycles-per-instruction(CPI) stack presents a picture of a typical instruction’s lifespan from fetch to completion . Provides information to narrow down to the bottleneck point(s) in the processor pipeline . POWER8 features a Completion-based CPI Stack accounting model . Time spent in the execution is split into : . Group Completion cycles . Stall cycles

Join the conversation at #OpenPOWERSummit 15 POWER8 CPI Stack

Stall due to Branch Stall due to BR or CR Stall due to CR Stall due to Fixed-Point Long Stall due to Fixed-Point Stall due to Fixed-Point (Other) Stall due to Vector Long Stall due to Vector Stall due to Vector (other) Stall due to Vector/Scalar Stall due to Scalar Long Stall due to Scalar Stall due to Scalar (other) Completion Stalls Stall due to Vector/Scalar (other) Stall due to Dcache Miss Stall due to LSU Reject Stall due to Store Finish Stall due to LSU Stall due to Load Finish Stall due to Store Forward Stall due to Load/Store (other) Stall due to Next-to-Complete Flush Cycles Waiting to Complete Blocked due to LWSync Blocked due to HWSync Blocked due to ECC Delay Thread Blocked Blocked due to Flush Blocked due to COQ Full Thread Blocked (other) Completion Table Empty due to Completion Table Empty due to IC L3 Miss IC Miss Completion Table Empty due to IC Miss (other) Completion Table Empty due to Branch Mispredict Completion Table Empty due to Branch Mispredict + IC Miss Completion Table Empty Dispatch Held due to Mapper Completion Table Empty – Dispatch Held due to Store Queue Dispatch Held Dispatch Held due to Issue Queue Dispatch Held (other) Completion Table Empty (Other) Completion Cycles

Join the conversation at #OpenPOWERSummit CPI Stack – LSU Stalls

Join the conversation at #OpenPOWERSummit 17 An Example of CPI Stack

CPI Stack 3.000

2.500

PM_CMPLU_STALL 2.000 PM_NTCG_ALL_FIN

1.500 PM_CMPLU_STALL_THRD PM_GCT_NOSLOT_CYC

1.000 PM_GRP_CMPL

0.500

0.000 Prefetch OFF Prefetch ON

Join the conversation at #OpenPOWERSummit 18 CPI Stack – Detailed Stall Distribution

Completion Stall Components

4.000 PM_CMPLU_STALL_BRU_CRU 3.500 PM_CMPLU_STALL_FXU

3.000 PM_CMPLU_STALL_VSU PM_CMPLU_STALL_VECTOR 2.500 PM_CMPLU_STALL_SCALAR

2.000 PM_CMPLU_STALL_NTCG_FLUSH PM_CMPLU_STALL_LSU 1.500 PM_CMPLU_STALL_DCACHE_MISS

1.000 PM_CMPLU_STALL_REJECT PM_CMPLU_STALL_STORE 0.500 PM_CMPLU_STALL_LOAD_FINISH

0.000 PM_CMPLU_STALL_ST_FWD Prefetch OFF Prefetch ON

Join the conversation at #OpenPOWERSummit 19 Data Source Analysis . Analysis of application data accesses across the Cache & Memory hierarchy is key to understanding the following • Performance limiting factors & resource requirements of the application • Scaling capabilities(in multi-threaded scenarios) . Cache hierarchy latencies:

Join the conversation at #OpenPOWERSummit 20 Prefetch Controls . Prefetch effects: • Positive . Brings data closer to the core . Reduces memory access stalls • Possible negative effects: . Extra Bandwidth consumption - choking other application memory accesses . Cache pollution . Increased power consumption . POWER8 supports L1 and L3 levels Prefetches . DSCR Register ( Power ISA v2.07 ) DPFD: Default Prefetch Depth SSE: Store Stream Enable SNSE: Stride-N Stream Enable LSD: Load Stream Disable URG: Depth Attainment Urgency

Join the conversation at #OpenPOWERSummit 21 Studying Prefetch Effectiveness . POWER8 provides performance events to study the prefetch effectiveness . Counters indicate usage and non-usage of cache lines that are prefetched into the cache at the time of eviction from the cache . Counters available: • MEPF Metrics are used to evaluate the Prefetch effectiveness in POWER8

Join the conversation at #OpenPOWERSummit 22 Application Profiling tools . Market Event Profiling: • Pinpoint performance inhibiting behavior/bottlenecks to specific instruction in application code . Why necessary? • Non-marked events are best suited to study performance metrics • In an OOO super-scalar multiple-issue processor, the profile data from non-marked events can only indicate code “region” responsible for performance bottlenecks • Code “region” granularity can range from few to tens of instructions.

Join the conversation at #OpenPOWERSummit 23 Example of Marked Event profiling

Join the conversation at #OpenPOWERSummit 24 Marked Events – a non-exhaustive list . PM_MRK_LD_MISS_L1 . PM_MRK_LD_MISS_L1_CYC . PM_MRK_BR_MPRED_CMPL . PM_MRK_BR_TAKEN_CMPL . PM_MRK_DATA_FROM_MEM . PM_MRK_LSU_REJECT . PM_MRK_STCX_FAIL . PM_MRK_GRP_IC_MISS . PM_MRK_DTLB_MISS . PM_MRK_ST_FWD . PM_MRK_LSU_FLUSH . PM_MRK_LSU_FLUSH_ULD . PM_MRK_LSU_FLUSH_UST

Join the conversation at #OpenPOWERSummit 25 Microarchitecture Analysis . Deep-dive analysis to root-cause performance inhibitor at processor pipeline stages. . Tools used: • Itrace • Cycle Accurate Simulator

Trace application with valgrind

Generate qtrace Microarchitecture Stats Analyze & simppc Optimize Scrollpipe Application code

Join the conversation at #OpenPOWERSummit 26 Tools for Microarchitecture Analysis . IBM SDK for on Power . IBM POWER8 Functional Simulator (systemsim) . Valgrind framework provides application/program tracing capabilities (itrace) . POWER8 Performance Simulator (sim_ppc)

https://www-304.ibm.com/webapp/set2/sas/f/lopdiags/sdklop.html

Join the conversation at #OpenPOWERSummit 27 Thank You!

Join the conversation at #OpenPOWERSummit 28