Performance Counter Monitoring

Dr. Roman Dementiev [email protected] Senior Application Engineer Software and Services Group

14 July 2010

Software & Services Group

1 Legal Disclaimer

Intel may make changes to specifications and product descriptions at any time, without notice. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information. Lead-free: 45nm product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). Some EU RoHS exemptions for lead may apply to other components used in the product package. Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900 PPM bromine and 900 PPM chlorine. Intel, Intel Xeon, Intel Core , and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. © 2009 Standard Performance Evaluation Corporation (SPEC) logo is reprinted with permission

Software & Services Group

2 Agenda

• CPU Utilization Monitoring • Performance Monitoring Units (PMU) in Processors • Offline analysis with PMU: Intel® VTune™ Performance Analyser • Online Dynamic Processor Monitoring NEW!

Software & Services Group

3 Operating System CPU Utilization Meter

• Most known meter, exists on almost any OS – Shows how long OS was in the idle/sleep loop – Worked well with CPUs of 80„s • But OS CPU Meters ignore – memory access stalls – synchronisation/locking – CPU I/O – Simultaneous multithreading (SMT) – Intel® Hyper-Threading – etc • How do I find out what keeps processor busy? Or is my software just wasting compute cycles?

Existing OS CPU meters can not predict capacity of modern hardware

Software & Services Group

4 CPU Utilization Meter in Hardware?

• Modern CPU systems are very complex and consist of many units/resources that influence computation speed

SYSTEM SOCKET (CPU) CORE

Software & Services Group

5 Performance Monitoring Units (PMUs)

• Intel® processors have Performance Monitoring Units (PMUs) that can be programmed to count many performance-related events – One PMU per logical core (number of elapsed cycles, L1, L2 cache, TLB events, processed instructions, there are hundreds of events) – One in PMU uncore (L3 cache, , Intel® QPI events)

Software & Services Group

6 Programming PMUs

• Programming by reading/writing Model Specific Registers • Much of hardware and events are platform specific • Core PMU is enumerate in CPUID Leaf A: – Number of fully programmable counters (4 per logical core), a counter is assigned to count a certain event – Number of fixed function counters exist (3 per logical core): core clocks counter, reference clock counter, instruction counter • Some uncore and core programmable counters can be only programmed with certain types of events • Other tricky restrictions apply, restructions are documented in the event list

Software & Services Group

7 Processor Performance Counters

Publicly documented on intel.com • David Levinthal ”Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors” http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf • Intel® 64 and IA-32 Architectures Software Developer‟s Manual, Volume 3B: System Programming Guide, Part 2 http://www.intel.com/products/processor/manuals/ • Intel® Xeon® Processor 7500 Series Uncore Programming Guide http://www.intel.com/Assets/en_US/PDF/designguide/323535.pdf • Peggy Irelan and Shihjong Kuo “Performance Monitoring Unit Sharing Guide ” http://software.intel.com/file/20476

Intel® Hyper-Threading Technology-specific: • Drysdale, Gillespie, Valles “Performance Insights to Intel® Hyper-Threading Technology” http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/ • Gillespie, Drysdale “Intel® Hyper-Threading Technology: Analysis of the HT Effects on a Server Transactional Workload” http://software.intel.com/en-us/articles/intel-hyper-threading- technology-analysis-of-the-ht-effects-on-a-server-transactional-workload/

Software & Services Group

8 PMU Sampling Mode: The Statistical Method of Finding Hotspots

• A sampling collector (like VTune™ Performance Analyzer or Intel® Performance Tuning Utility) – PMU periodically interrupts the processor • Triggered by the occurrence of a certain number of events – Collects the execution context • Execution address in memory (CS:IP) • Operating system process and thread ID • Executable module loaded at that address – If you have symbols for the module, post-processing can identify the function or method at the memory address. – Line numbers from the symbol file can direct you to the relevant line of source code.

Software & Services Group Introducing Intel® VTune™ Performance Analyzer

• Helps identify and characterize performance issues by: – Collecting performance data from the system running your application. – Organizing and displaying the data in a variety of interactive views, from system-wide down to source code or processor instruction perspective. – Identifying potential performance issues and suggesting improvements. – Providing application profiling information – Provides Tuning assistant and great help system • Besides sampling analysis with PMU can also produce call-graph (not covered here)

Software & Services Group Just a few things you can do with processor performance events • Check if your software is NUMA-optimized (local/remote memory accesses) • Cache-local or not • Memory bandwidth bound or not • Branchy or not (branch misspredictions) • Has „bad“ long latency instructions on critical path • Has performance bugs in multithreaded programs ( false-sharing,…) • Exploits instruction parallelism well or not • See also the article „Using Intel® VTune™ Performance Analyzer to Optimize Software for the Intel® Core™ i7 Processor Family” http://software.intel.com/en- us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-for-the-intelr-coretm-i7-processor- family/

Software & Services Group

11 DEMO

• Intel® VTune™ Performance Analyzer in action!

Software & Services Group

12 Offline Analysis: VTune™ Performance Analyzer Sampling Collector

Select Event: Clock ticks, L2/L3 cache misses, branch misspredictions, etc.

Software & Services Group

13 Offline Analysis: Intel VTune™ Performance Analyzer Sampling Collector Offline Analysis: Intel® VTune™ Analyser

Hotspot view of one module for all OS processes and threads grouped by function (or method).

Software & Services Group

14 Sampling Source View Displays Source Code Annotated with Performance Data

Software & Services Group

15 PMU Counting Mode

• No interrupts generated • Application reads (periodically) the number of occured events from the PMU counters • Very small overhead • Advances online use-cases possible: next slides

Software & Services Group

16 Online Performance Counter Monitoring: Access Intel® CPU Counters* in Your Program

Terminology: • System consists of several sockets (=CPUs) • Socket has a number (logical) cores

Usage pattern 1. Save counter state for {core,socket,system} into a state object 1 2. Run user code or experiment 3. Save counter state for {core,socket,system} into a state object 2 4. Using state object 1 and 2 compute performance/utilization metrics

Caution: OS may schedule different user threads on the same core (context switches) Access not only core counters (clock ticks, L2 cache misses, etc) but also NEW! uncore (Intel® memory controllers, Intel® QPI, etc) counters*

* Implemented for Intel® Core™ i7, Xeon® 5500, 5600 and 7500 Processor Series (based on Software & Services Group microarchitecture codenamed Nehalem/Westmere) 17 Example C++ code

Monitor * m = Monitor::getInstance(); if(m->good()) m->program(); // program counters SystemCounterState before_sstate, after_sstate; before_sstate = getSystemCounterState(); [run your code here] after_sstate = getSystemCounterState();

cout<<“IPC:“<< getIPC(before_sstate,after_sstate)<< “L3 cache hit ratio:” << getL3CacheHitRatio(before_sstate,after_sstate) << “Bytes read:”<< getBytesReadFromMC(before_sstate,after_sstate) << [and so on]…

Software & Services Group

18 Example 1

• Compare traversal/searching in the STL list vs. STL vector (4 byte records) • C++ code to measure:

std::find( ds.begin(), ds.end(), ds.size());

Get CPU performance insights in real time Software & Services Group

19 Intel® Performance Counter Monitor* (Linux*/Windows*)

Easily collect CPU performance data Software & Services Group

20 *the name might be changed in future Linux* KDE* plug-in

Visualize CPU performance in real time Software & Services Group

21 Advanced Examples NEW!

• Software reads data from PMUs in online fashion

Self-tuning software !!

Software & Services Group

22 Example 2 “CPU resource“-aware scheduling

• Problem (a simplified one): – schedule 1000 compute-intensive and 1000 memory bandwidth intensive jobs on a single core – jobs are equal in size – background unknown activity exists • Goal: minimize total completion time

Software & Services Group

23 CPU Monitoring Unaware Scheduler

time

Memory-band intensive background activity

compute intensive 11 jobs

memory- bandwidth 11 Intensive jobs

Software & Services Group

24 CPU Monitoring Aware Scheduler

time

Memory-band intensive background activity

compute intensive 12 jobs

memory- bandwidth 13 Intensive jobs In an experiment with 2000 jobs we measured 16% faster completion time*

•Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

Software & Services Group

25 Advanced Use-Cases I

• Extend the problem (to be closer to reality): – Schedule to all Hyper-Threaded cores in the system – The remaining capacities are not known a priori because the jobs are not predictable in exact resource utilization

• Do we have a room to put another job on this HT core? – Should it be compute intensive or rather memory intensive job? • CPU Performance Monitoring can provide more insights and help to answer these questions

Software & Services Group

26 Advanced Use-Cases II

• Depending on remaining resource capacities choose the best algorithm to compute result – mem-intensive or – compute-intensive • Choose between implementations – single-threaded or – multithreaded (all cores) or – with limited threading

– and, so on…

Software & Services Group

27 Conclusions and Takeaways

• Current OS CPU utilization meters are not suited for modern hardware

• Modern processor PMUs provide metrics to get deep insight into processor performance and resource utilization

• Processor performance counters are heavily used in established performance tools like Intel® VTune™ Performance Analyser

• New advanced use-cases for PMUs for dynamic online optimization possible – new kind of intelligent CPU-monitoring aware software

Software & Services Group

28 Software & Services Group

29