Optimizing processor performance for Wintel applications: a case study

Demand Technology Software 1020 Eighth Avenue South, Suite 6, Naples, FL 34102 phone: (941) 261-8945 fax: (941) 261-5456 e-mail:markf@ demandtech.com http://www.demandtech.com

1998 Demand Technology, Inc. Application tuning case study

■ The target application is a C program written to perform interval performance data collection for Windows NT ♦ It is important that this program perform well because “If you are not part of the solution, you are part of the problem.” ♦ Performance Analysts use our product, and they are very demanding customers

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 2 Application tuning case study

■ At a point in development where the code was reasonably mature and stable, I subjected it to performance analysis using several tools. ♦ Visual C++ version 5 ♦ Rational Visual Quantify execution profiler ♦ vTune version 2.5 optimization tool

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 3 Windows NT on Intel hardware

■ In order to use vTune effectively, it helps to understand how Intel processor hardware works ♦ extensive Intel processor documentation ships on the product CD ■ Target environment: ♦ NT 4.0 ♦ Intel Pentium and hardware

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 4 NT performance monitoring

■ Performance SeNTrytm collection agent

initialization

loop until cycle end = TRUE; Win32 API calls to retrieve performance data; calculate; Write data to file; end loop;

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 5 Win32 performance monitoring API

■ Familiar and well-documented interface ♦ The only programmatic way to enumerate the Processes running on an NT system ♦ NT Performance data is structured as ■ Objects (records) ■ Counters (fields) ♦ Collection agents are associated with Objects ■ NT base Objects, including kernel Objects ■ extended Objects require a Perflib dll

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 6 Data C ollection sets

■ Define proper subsets of a Master C ollection set ♦ Defines all known Objects and Counters ♦ Some Objects are instanced: there can be multiple occurrences of instanced O bjects ♦ Two Parent:Child relationships defined ■ Process is the parent of Thread ■ Physical Disk is the parent of Logical Disk

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 7 Data C ollection sets

■ Performance considerations ♦ Data collection is performed one Object at a time ■ This was necessary due to a bug in the Win32 collection services ■ An n:1 correspondence between Objects and their associated collection routines ♦ With the exception of C hild O bjects ■ They are collected at the same time as the Parent Objects ♦ There are many instances of some Objects ■ Process and Thread

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 8 Data C ollection sets

■ Performance considerations ♦ There are compelling reasons why data collection should be done at frequent intervals ■ identified by Buzen and Shum, 1996 ♦ Performance data for processes that terminate before the end of the interval is lost ♦ one collection interval used for both Accumulator C ounters (processor time) and Instantaneous Counters (e.g., processor Queue length)

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 9 Data C ollection sets

■ Performance considerations ♦ Ideally, collection should be performed at least once per minute; ♦ possibly, some Objects could be collected even more frequently in order to accumulate samples of Instantaneous C ounter values ♦ Can our code handle it?

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 10 Goals of the tuning exercise

■ Profile our code execution path so that we can understand it better ♦ Profilers eliminate a lot of idle speculation about what your code is doing ■ Better understand the Win32 services and their interaction with our code ♦ We cannot changes these services, but perhaps we can interact with them in better ways

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 11 Goals of the tuning exercise

■ Evaluate code optimization strategies ♦ optimizing Compiler options ■ Pentium and Pentium Pro specific optimizations ♦ In-line assembler ♦ Code restructuring ♦ etc. ■ Feed results forward into the development process

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 12 VC++ code profiler

■ Built-in compiler option ■ Times program functions during run time ♦ Must run the application under the debugger ■ Creates a text report showing: ♦ F unction time ♦ Function+Child Function time ♦ Hit Count ■ Example: DefaultCollectionSet once per second

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 13 VC++ code profiler output

Module Statistics for dmperfss.exe ------Time in module: 283541.261 millisecond Percent of time in module: 100.0% Functions in module: 155 Hits in module: 11616795 Module function coverage: 72.3%

Func Func+Child Hit Time % Time % Count Function ------248146.507 87.5 248146.507 87.5 249 _WaitOnEvent (dmwrdata.obj) 8795.822 3.1 8795.822 3.1 393329 _WriteDataToFile (dmwrdata. 4413.518 1.6 4413.518 1.6 2750 _GetPerfDataFromRegistry (dm 3281.442 1.2 8153.656 2.9 170615 _FormatWriteThisObjectCount 3268.991 1.2 12737.758 4.5 96912 _FindPreviousObjectInstanceC 2951.455 1.0 2951.455 1.0 3330628 _NextCounterDef (dmwrdata.ob

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 14 VC++ profiler: Observations

■ Our program is “sleeping” 87.5% of the time! ■ Can only look at your program’s code ♦ If your function is spending all its time making Win32 API calls or calling other dlls, they are not visible ■ Parent-child relationships among modules are not readily apparent

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 15 Rational Visual Quantify

■ Add-on product ♦ Visual Studio “integration” ■ Select profiling at the level of the function call or the line ■ Adds instrumentation to each module during the runtime session ♦ Includes all shareable and relocatable exes and dlls called by your program!

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 16 Rational Visual Quantify

■ Reporting ♦ graphic view of your program’s critical execution path ■ breaks out dlls and some system services ♦ parent-child relationships among modules is explicit ♦ convenient navigation between views ■ Performs analysis of ∆ between two execution runs

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 17

Rational Visual Quantify

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 20

Rational VQ: Observations

■ Added instrumentation affects absolute function time values observed ♦ We only spent 32% of our time “Sleeping” ♦ relative timing relationship between functions appear unaffected ■ App is very intuitive and easy to use ♦ e.g., relationships between function calls ■ Ability to trace module execution through 3rd party functions can be very useful! 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 23 Intel vTune

■ Standalone execution profiler ■ Relies on system-wide sampling ♦ maps the location of the Program C ounter to the module in memory ♦ catches every program, including the OS ■ Optionally, can also be used to report on the Pentium/Pentium Pro performance metrics during program execution

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 24

Intel vTune

■ High percentage of samples showed NT running the Idle Thread! ■ Switched to Master Collection set once per second to generate more activity ♦ Rational VQ overhead was too high to perform a comparable test ♦ R esult: very different profile of activity

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 26

Intel vTune

■ Hotspot analysis showed two functions accounted for > 70% of the activity inside our process address space ♦ NextInstanceDef ♦ IsPreviousAndParentSameInstance ■ vTune analyzes assembler code to assist you in taking advantage of the superscalar features of the P5

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 28 Intel processor performance overview

■ Complex Instruction set (CISC) ♦ Maintain upward compatibility with original 8-bit 8080 instruction set ■ With improvements in semiconductor fabrication, add ♦ pipelining, TLB, cache, branch prediction ■ 486 ♦ elements of RISC superscalar processors ■ Pentium, Pentium Pro

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 29 Intel processor evolution

Processor Year Clock Speed Bus Width Addressable Transistors (MHz) (bits) Mem or y 8080 1974 2 8 64K 6,000 8086 1978 5-10 16 1 MB 29,000 8088 1979 5-8 8 1 MB 29,000 80286 1982 8-12 16 16 MB 134,000 386 DX 1985 16-33 32 4 GB 275,000 486 DX 1989 25-50 32 4 GB 1,200,000 Pentium 1993 60-233 32 4 GB 3,100,000 Pentium Pro 1995 150-200 64 4 GB 5,500,000 Pentium II 1997 233-333 64 4 GB 7,500,000

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 30 Intel processor evolution

Processor Highlights 8080 1 chip microprocessor 8086 10X performance 8080 8088 8 bit version of 8086 80286 Virtual Memory 386 DX 32 bit Registers 486 DX L1 Cache; pipelined Pentium dual integer pipeline Pentium Pro microarchitecture Pentium II Dual bus; MMX

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 31 Intel x86 pipeline

■ 5-stage pipeline introduced with 486 ♦ Integrated 8K data and instruction cache ♦ Prefetch ♦ Instruction Decode 1 ♦ Instruction Decode 2 (address calculation) ♦ Execute ♦ Write Back PFD1 D2 EX WB

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 32 Intel x86 pipeline expectations

Instruction 1 PFD1 D2 EX WB

Instruction 2 PFD1 D2 EX WB

Instruction 3 PFD1 D2 EX WB

Instruction 4 PFD1 D2 EX WB

Instruction 5 PFD1 D2 EX WB

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 33 Intel x86 pipeline performance

■ Processor performance ≈ Clock cycle speed * cycles per instruction (CPI) ■ Intel instruction set complexity reduces the effectiveness of pipelining ♦ “Integer” instruction cycle times range from 1-9 ♦ rep instruction prefixes have 4 clock startup overhead 1998 Demand Technology,♦ 32-bit Inc. addressWintel application far tuning:call processortakes optimization 22 clock cycles 34 Intel 586 dual integer pipeline

■ Improved processor performance (CPI) because certain instruction pairs can execute in parallel u pipe D1 D2 EX WB PF v pipe D1 D2 EX WB

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 35 Intel 586 dual integer pipeline

■ Rules for instruction pairing are arcane ♦ One-cycle instructions can usually be paired ■ a RISC subset within a CISC ♦ Instructions with immediate operands or address displacements cannot be paired ♦ If there is an explicit Register dependency between instructions, there is no pairing ■ x86 only contains 8 General Purpose Registers ■ RISC processors depend on compilers that understand how to exploit them!

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 36 Intel 586 dual integer pipeline

■ Intel only manufactures hardware, so… ♦ Intel introduces processor measurements so that you can collect actual CPI performance statistics ♦ Intel introduces vTune to assist developers in analyzing software developed to run on its hardware

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 37 vTune code analysis

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 38

VC++ code generation options

■ Try different “G ” Optimization switches... ♦ G5 → P5 optimization ■ some improvement, but the assembler code generated in the hot spot was unchanged.

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 40 VC++ code generation options

■ C language code: PERF_INSTANCE_DEFINITION * NextInstanceDef ( PERF_INSTANCE_DEFINITION *pInstance ) { PERF_COUNTER_BLOCK *pCtrBlk; pCtrBlk = (PERF_COUNTER_BLOCK *) ((PBYTE)pInstance + pInstance->ByteLength); return (PE R F _INSTANC E _DE F INITIO N *) ((PBYTE)pInstance + pInstance->ByteLength + pCtrBlk->ByteLength); }

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 41 VC++ code generation options

■ Generates tight Assember code

00408D40 mov ecx,dword ptr [esp+4] 00408D44 mov edx,dword ptr [ecx] 00408D46 mov eax,dword ptr [ecx+edx] 00408D49 add eax,ecx 00408D4B add eax,edx 00408D4D ret

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 42 Intel 686 microarchitecture

■ Increased parallelism ■ Instructions are translated into RISC micro- instructions ■ Pool of 40 GP pseudo-Registers ■ Micro-instructions can be executed out of sequence ■ CPI ≈ rate at which instructions are retired

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 43 Intel 686 microarchitecture

■ Performance not nearly so dependent on actual code generation, since the processor has the capability to unwind instructions and execute them out of order. ■ my vTune test runs were made on a P6! ♦ vTune does not analyze code running on a P6 - there should be little or no need to ♦ Note: internally, the Pentium II is a P6

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 44 E pilogue

■ Rational VQ and Intel vTune tests gave very different execution profiles ♦ Could we isolate the Collection set dependency? ♦ Yes, returned to VQ using a Collection set that included Thread Objects and duplicated the vTune results ■ VQ helped us zero in execution path issues ♦ but there was considerable measurement overhead ■ P6 probably makes vTune obsolete 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 45 Where to get more information

■ Windows NT Workstation 4.0 Resource Kit ■ Microsoft Developer Network C D ■ Intel vTune documentation (most is also available from Intel’s Web site click here.) ■ Computer Architecture: A Quantitative Approach, Hennesey and Patterson ■ Pentium (Pro) Processor System Architecture, Mindshare, Inc. ■ Inner Loops , Booth ■ The Indispensable Pentium Book, Messner

1998 Demand Technology, Inc. Wintel application tuning: processor optimization 46