Rochester

BlueGene External Performance Instrumentation Facility

Roy Musselman Other Contributors: Dave Hermsmeier, Kurt Pinnow, Brent Swartz Blue Gene Software Development IBM Rochester, Minnesota

ScicomP 12 IBM System Scientific Computing User Group Boulder, CO, July 18-21, 2006

© 2006 IBM Corporation IBM Systems & Technology Group Rochester

Agenda

 Performance Monitor Tools Overview – PAPI – HPC Toolkit (LIBHPM) – External Performance Instrumentation Facility (EPIF)

 EPIF – Interface to the Hardware Performance Counters – Features – Operation – Commands  Demo – Features and application example

2 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Performance Monitoring Tools Overview

 PAPI – Performance Application Programming Interface – Defines a standard interface for accessing performance counter hardware – Instrumentation and data collection is managed from within the application. – Available at: http://icl.cs.utk.edu/papi/index.html  High Performance Computing (HPC) Toolkit – Developed by ACTC, IBM Research http://www.research.ibm.com/actc – LIBHPM – Detailed hardware performance monitoring – Instrumentation and data collection is managed from within the application. – Packaged with other complementary tools to profile and visualize results  BlueGene/L External Performance Instrumentation Facility (EPIF) – No change to the application required, thus no direct correlation to program execution. – Negligible impact to performance: Uses the control system network to extract counter data asynchronously with the execution of the applications – Expanded with new function and now generally available in BlueGene\L V1R3.  All three tools utilize the hardware performance counters implemented on BlueGene/L.

3 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

BlueGene’s Hardware Performance Counter Mechanism

 All three performance monitoring tools utilize this mechanism.  Special logic within the Compute node taps into the various components. – Processors & FPUs, L2 and L3 hit/miss, torus and tree network events – 328 total unique events

 Up to 52 of the 328 events can be counted concurrently using 32-bit counter registers.  At periodic intervals, the 32-bit counters are read and accumulated into 64-bit locations in SRAM  Current Limitations – The 32-bit counters may overflow thus necessitating the software accumulation. – Contention for limited FPU event counter resources – Only one type of Load/Store instruction count per processor – Only one type of FPU Instruction count per processor – In V1R3, the derived FPU counters will sample the FPU instructions in a round-robin fashion across the processors

4 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Compute Node Counters Monitor Hardware Events 5.5GB/s

11GB/s PLB (4:1) 32k/32k L1 256 2.7GB/s 128 L2 440 CPU 4MB

EDRAM Shared “Double FPU” L3 directory L3 Cache 1024+ Multiported for EDRAM or 256 144 ECC snoop Shared Memory 5.6GF SRAM 32k/32k L1 peak 128 Buffer 22GB/s L2 node 440 CPU 256 I/O proc Includes ECC

256 “Double FPU” l

128

DDR Ethernet JTAG Control Gbit Access Torus Tree Global with ECC Interrupt 5.5 GB/s

Gbit JTAG 6 out and 3 out and 144 bit wide Ethernet 6 in, each at 3 in, each at 4 global DDR 1.4 Gbit/s link 2.8 Gbit/s link barriers or 256MB interrupts

5 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

EPIF’s Interface to the Hardware Performance Counters

 Prior to program invocation, the hardware counter logic on the Compute Node chip is programmed to capture the occurrences of a subset of hardware events: – Ex. L3 hits/misses, FPU operations, Torus packet activity – The user can choose one of 22 possible predefined subsets (a.k.a. counter definition ids) consisting of up to 52 of the 328 possible events – Counter definition ids 0:16 are identical to those used by LIBHPM

 The counter data is periodically read from the SRAM and retrieved by the service node via the control system network (JTAG)

 EPIF manages the collection, filtering, and storage of the counter data.

 File system storage required: 340KB per sample per midplane for each sample that is preserved.

6 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

EPIF Key Features

 Easy to use  Provides a non-intrusive mechanism of monitoring system and job performance characteristics. – No application changes are required other than: – Just relink the application with the –lbgl_perfctr.rts library – The interval timer is used to trigger the counter sample and accumulate to SRAM.

 Minimal performance impact to the running applications – Sampling of counter data is done with negligible performance impact. – Collection of data is done via the control system network (JTAG)

 EPIF provides the following: – A GUI to browse results – Storage of results to the external file system – Option to store results to the MMCS SQL database – Ability to filter and organize the counter and attribute data. – Supports CSV export formats for easy import into spreadsheets – Derived FPU counters for aggregate estimates of FLOP rates

7 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

8 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Perfmon Operation

 Jobs are initiated as usual with either a default or specified counter definition ID – Counter definition ID – a predefined subset of counters consisting of up to 52 of the 328 possible hardware events that can be monitored – Specified by the BGL_PERFMON environment variable  One or more instances of perfmon can be started on the service node, each with their own set of parameters including: – Sample interval – Attributes to filter the set of jobs to be monitored (ex. user name, block id, etc.) – Sample type: detailed or summary – Destination of the collected data: file system directory and optionally to the MMCS SQL database  Perfmon will monitor all running jobs except for the following: – Those jobs that do not match the filter criteria used to initiate the perfmon application. – Those jobs that have not been linked with the performance counter library – Those jobs that have been instrumented with other tools using the performance counters (ie. PAPI or LIBHPM)

9 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

EPIF commands  perfmon – Starts an instance of the performance monitor tool. Options control the collection of hardware counter data. – --username=‘(userid1,userid2,userid3)’ – --block_id=‘(R0*,R0R1R2R3)’ – --sample_type=d  dsp_perfmon – Provides a simple GUI to view performance data and do some high-level distillation of the collected data – Works on data actively being collected and data that was previously collected

 ext_perfmon_data – Extracts performance data to CSV files for analysis by other tools. Many options available to filter the extracted data.

10 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

EPIF Commands (cont.)

 imp_perfmon_data – Imports collected performance data to the MMCS SQL performance database

 exp_perfmon_data – Exports performance data from the MMCS SQL performance database to the external file system, optionally deleting the data from the SQL database

 end_perfmon – Ends in instance of perfmon prior to the ending criteria specified on the perfmon command

11 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

EPIF complements other performance monitoring tools

 EPIF is not intended to be an all inclusive, comprehensive set of performance tools.

 EPIF deals exclusively with the performance data obtainable from the hardware performance counters.

 EPIF does not replace PAPI or LIBHPM, which can be used to zero-in on specific code segments.

 EPIF can be used by system administrators for real-time system and job activity monitoring. (detecting hung jobs, summarizing job statistics)

 EPIF can be used by programmers with access to the service node for an aggregate view of application performance.

 Other data analysis and visualization tools can utilize the detailed data obtained from EPIF.

12 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Demo of dsp_perfmon

python dsp_perfmon.py Navigate to find and select the .mon file

13 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester dsp_perfmon demo: List of jobs monitored by this perfmon instance

14 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester dsp_perfmon demo: List of filters and runtime attributes

15 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester dsp_perfmon demo: List of job and block attributes

16 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester dsp_perfmon demo: Explore Via Samples/Nodes/Counters

17 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester dsp_perfmon demo: Extract Perfmon Data ( right click on Sample 4 )

18 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester dsp_perfmon demo: Extracted .csv file

19 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester dsp_perfmon demo: Extracted histogram data

20 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Application Example X=0, Z\Y 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 17 19 21 30 30 21 19 17 17 19 21 30 30 21 19 17 1 19 22 25 30 30 25 22 19 19 22 25 30 30 25 22 19 2 21 25 30 35 35 30 25 21 21 25 31 35 35 31 25 21 3 30 30 35 44 44 35 31 30 30 31 35 44 44 35 31 30 4 30 30 35 44 44 35 30 30 30 31 35 44 44 35 31 30  An application was exhibiting very poor 5 21 25 31 35 35 30 25 21 21 25 30 35 35 31 25 21 performance when executing multiple 6 19 22 25 30 30 25 22 19 19 22 25 30 30 25 22 19 7 17 19 21 30 30 21 19 17 17 19 21 30 30 21 19 17 concurrent point-to-point MPI operations. 8 17 19 21 30 30 21 19 17 17 19 21 30 30 21 19 17 9 19 22 25 30 30 25 22 19 19 22 25 30 30 25 22 19 10 21 25 31 35 35 30 25 21 21 25 30 35 35 30 25 21  Suspected network congestion. 11 30 31 35 44 44 35 30 30 30 30 35 44 44 35 30 30 12 30 31 35 44 44 35 30 30 30 31 35 44 44 35 30 30 13 21 25 31 35 35 31 25 21 21 25 31 35 35 30 25 21  Needed a method to detect and visualize the 14 19 22 25 30 30 25 22 19 19 22 25 30 30 25 22 19 torus network activity within the system. 15 17 19 21 30 30 21 19 17 17 19 21 30 30 21 19 17  With no source code changes, the EPIF was used to capture the torus network packet transmission counters and export the specific data to file.  A custom visualization tool was then used to colorize the various ranges of counter values and map them to the node locations to reveal the congested areas. (hot spots)

21 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Future Development  We believe that this style of external instrumentation has great potential in high-performance computing.

 The definition of future functionality is currently being considered.

 We solicit feedback and suggestions to – Evaluate and experiment with the current facility – Influence future design – Help us to provide functionality that is most important to the high performance computing community .

 We encourage other analysis and visualization tool developers to consider the possibilities of utilizing the data provided by EPIF for enhancements to their offerings.

22 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Resources: Support Website  http://www-03.ibm.com/servers/eserver/support/bluegene/index.html

23 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation IBM Systems & Technology Group Rochester

Resources: Redbooks

Detailed documentation of the External Performance Instrumentation Facility is available in the Redbook entitled: “Blue Gene/L: Performance Analysis Tools”

24 Blue Gene External Performance Instrumentation Facility | ScicomP 12 © 2006 IBM Corporation