Processor Performance Counter Monitoring

Processor Performance Counter Monitoring Dr. Roman Dementiev [email protected] Senior Application Engineer Software and Services Group 14 July 2010 Software & Services Group 1 Legal Disclaimer Intel may make changes to specifications and product descriptions at any time, without notice. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, visit Intel Performance Benchmark Limitations Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.intel.com/products/processor_number for details. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel® 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information. Lead-free: 45nm product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). Some EU RoHS exemptions for lead may apply to other components used in the product package. Halogen-free: Applies only to halogenated flame retardants and PVC in components. Halogens are below 900 PPM bromine and 900 PPM chlorine. Intel, Intel Xeon, Intel Core microarchitecture, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. © 2009 Standard Performance Evaluation Corporation (SPEC) logo is reprinted with permission Software & Services Group 2 Agenda • CPU Utilization Monitoring • Performance Monitoring Units (PMU) in Processors • Offline analysis with PMU: Intel® VTune™ Performance Analyser • Online Dynamic Processor Monitoring NEW! Software & Services Group 3 Operating System CPU Utilization Meter • Most known meter, exists on almost any OS – Shows how long OS was in the idle/sleep loop – Worked well with CPUs of 80„s • But OS CPU Meters ignore – memory access stalls – synchronisation/locking – CPU I/O – Simultaneous multithreading (SMT) – Intel® Hyper-Threading – etc • How do I find out what keeps processor busy? Or is my software just wasting compute cycles? Existing OS CPU meters can not predict capacity of modern hardware Software & Services Group 4 CPU Utilization Meter in Hardware? • Modern CPU systems are very complex and consist of many units/resources that influence computation speed SYSTEM SOCKET (CPU) CORE Software & Services Group 5 Performance Monitoring Units (PMUs) • Intel® processors have Performance Monitoring Units (PMUs) that can be programmed to count many performance-related events – One PMU per logical core (number of elapsed cycles, L1, L2 cache, TLB events, processed instructions, there are hundreds of events) – One in PMU uncore (L3 cache, memory controller, Intel® QPI events) Software & Services Group 6 Programming PMUs • Programming by reading/writing Model Specific Registers • Much of hardware and events are platform specific • Core PMU is enumerate in CPUID Leaf A: – Number of fully programmable counters (4 per logical core), a counter is assigned to count a certain event – Number of fixed function counters exist (3 per logical core): core clocks counter, reference clock counter, instruction counter • Some uncore and core programmable counters can be only programmed with certain types of events • Other tricky restrictions apply, restructions are documented in the event list Software & Services Group 7 Processor Performance Counters Publicly documented on intel.com • David Levinthal ”Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors” http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf • Intel® 64 and IA-32 Architectures Software Developer‟s Manual, Volume 3B: System Programming Guide, Part 2 http://www.intel.com/products/processor/manuals/ • Intel® Xeon® Processor 7500 Series Uncore Programming Guide http://www.intel.com/Assets/en_US/PDF/designguide/323535.pdf • Peggy Irelan and Shihjong Kuo “Performance Monitoring Unit Sharing Guide ” http://software.intel.com/file/20476 Intel® Hyper-Threading Technology-specific: • Drysdale, Gillespie, Valles “Performance Insights to Intel® Hyper-Threading Technology” http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/ • Gillespie, Drysdale “Intel® Hyper-Threading Technology: Analysis of the HT Effects on a Server Transactional Workload” http://software.intel.com/en-us/articles/intel-hyper-threading- technology-analysis-of-the-ht-effects-on-a-server-transactional-workload/ Software & Services Group 8 PMU Sampling Mode: The Statistical Method of Finding Hotspots • A sampling collector (like VTune™ Performance Analyzer or Intel® Performance Tuning Utility) – PMU periodically interrupts the processor • Triggered by the occurrence of a certain number of events – Collects the execution context • Execution address in memory (CS:IP) • Operating system process and thread ID • Executable module loaded at that address – If you have symbols for the module, post-processing can identify the function or method at the memory address. – Line numbers from the symbol file can direct you to the relevant line of source code. Software & Services Group Introducing Intel® VTune™ Performance Analyzer • Helps identify and characterize performance issues by: – Collecting performance data from the system running your application. – Organizing and displaying the data in a variety of interactive views, from system-wide down to source code or processor instruction perspective. – Identifying potential performance issues and suggesting improvements. – Providing application profiling information – Provides Tuning assistant and great help system • Besides sampling analysis with PMU can also produce call-graph (not covered here) Software & Services Group Just a few things you can do with processor performance events • Check if your software is NUMA-optimized (local/remote memory accesses) • Cache-local or not • Memory bandwidth bound or not • Branchy or not (branch misspredictions) • Has „bad“ long latency instructions on critical path • Has performance bugs in multithreaded programs ( false-sharing,…) • Exploits instruction parallelism well or not • See also the article „Using Intel® VTune™ Performance Analyzer to Optimize Software for the Intel® Core™ i7 Processor Family” http://software.intel.com/en- us/articles/using-intel-vtune-performance-analyzer-to-optimize-software-for-the-intelr-coretm-i7-processor- family/ Software & Services Group 11 DEMO • Intel® VTune™ Performance Analyzer in action! Software & Services Group 12 Offline Analysis: VTune™ Performance Analyzer Sampling Collector Select Event: Clock ticks, L2/L3 cache misses, branch misspredictions, etc. Software & Services Group 13 Offline Analysis: Intel VTune™ Performance Analyzer Sampling Collector Offline Analysis: Intel® VTune™ Analyser Hotspot view of one module for all OS processes and threads grouped by function (or method). Software & Services Group 14 Sampling Source View Displays Source Code Annotated with Performance Data Software & Services Group 15 PMU Counting Mode • No interrupts generated • Application reads (periodically) the number of occured events from the PMU counters • Very small overhead • Advances online use-cases possible: next slides Software & Services Group 16 Online Performance Counter Monitoring: Access Intel® CPU Counters* in Your Program Terminology: • System consists of several sockets (=CPUs) • Socket has a number (logical) cores Usage pattern 1. Save counter state for {core,socket,system} into a state object 1 2. Run user code or experiment 3. Save counter state for {core,socket,system} into a state object 2 4. Using state object 1 and 2 compute performance/utilization metrics Caution: OS may schedule different user threads on the same core (context switches) Access not only core counters (clock ticks, L2 cache misses, etc) but also NEW! uncore (Intel® memory controllers, Intel® QPI, etc) counters*

Load more