Optimizing Processor Performance for Wintel Applications: a Case Study
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing processor performance for Wintel applications: a case study Demand Technology Software 1020 Eighth Avenue South, Suite 6, Naples, FL 34102 phone: (941) 261-8945 fax: (941) 261-5456 e-mail:markf@ demandtech.com http://www.demandtech.com 1998 Demand Technology, Inc. Application tuning case study ■ The target application is a C program written to perform interval performance data collection for Windows NT ♦ It is important that this program perform well because “If you are not part of the solution, you are part of the problem.” ♦ Performance Analysts use our product, and they are very demanding customers 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 2 Application tuning case study ■ At a point in development where the code was reasonably mature and stable, I subjected it to performance analysis using several tools. ♦ Microsoft Visual C++ version 5 ♦ Rational Visual Quantify execution profiler ♦ Intel vTune version 2.5 optimization tool 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 3 Windows NT on Intel hardware ■ In order to use vTune effectively, it helps to understand how Intel processor hardware works ♦ extensive Intel processor documentation ships on the product CD ■ Target environment: ♦ Microsoft Windows NT 4.0 ♦ Intel Pentium and Pentium Pro hardware 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 4 NT performance monitoring ■ Performance SeNTrytm collection agent initialization loop until cycle end = TRUE; Win32 API calls to retrieve performance data; calculate; Write data to file; end loop; 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 5 Win32 performance monitoring API ■ Familiar and well-documented interface ♦ The only programmatic way to enumerate the Processes running on an NT system ♦ NT Performance data is structured as ■ Objects (records) ■ Counters (fields) ♦ Collection agents are associated with Objects ■ NT base Objects, including kernel Objects ■ extended Objects require a Perflib dll 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 6 Data C ollection sets ■ Define proper subsets of a Master C ollection set ♦ Defines all known Objects and Counters ♦ Some Objects are instanced: there can be multiple occurrences of instanced O bjects ♦ Two Parent:Child relationships defined ■ Process is the parent of Thread ■ Physical Disk is the parent of Logical Disk 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 7 Data C ollection sets ■ Performance considerations ♦ Data collection is performed one Object at a time ■ This was necessary due to a bug in the Win32 collection services ■ An n:1 correspondence between Objects and their associated collection routines ♦ With the exception of C hild O bjects ■ They are collected at the same time as the Parent Objects ♦ There are many instances of some Objects ■ Process and Thread 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 8 Data C ollection sets ■ Performance considerations ♦ There are compelling reasons why data collection should be done at frequent intervals ■ identified by Buzen and Shum, 1996 ♦ Performance data for processes that terminate before the end of the interval is lost ♦ one collection interval used for both Accumulator C ounters (processor time) and Instantaneous Counters (e.g., processor Queue length) 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 9 Data C ollection sets ■ Performance considerations ♦ Ideally, collection should be performed at least once per minute; ♦ possibly, some Objects could be collected even more frequently in order to accumulate samples of Instantaneous C ounter values ♦ Can our code handle it? 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 10 Goals of the tuning exercise ■ Profile our code execution path so that we can understand it better ♦ Profilers eliminate a lot of idle speculation about what your code is doing ■ Better understand the Win32 services and their interaction with our code ♦ We cannot changes these services, but perhaps we can interact with them in better ways 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 11 Goals of the tuning exercise ■ Evaluate code optimization strategies ♦ optimizing Compiler options ■ Pentium and Pentium Pro specific optimizations ♦ In-line assembler ♦ Code restructuring ♦ etc. ■ Feed results forward into the development process 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 12 VC++ code profiler ■ Built-in compiler option ■ Times program functions during run time ♦ Must run the application under the debugger ■ Creates a text report showing: ♦ F unction time ♦ Function+Child Function time ♦ Hit Count ■ Example: DefaultCollectionSet once per second 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 13 VC++ code profiler output Module Statistics for dmperfss.exe ---------------------------------- Time in module: 283541.261 millisecond Percent of time in module: 100.0% Functions in module: 155 Hits in module: 11616795 Module function coverage: 72.3% Func Func+Child Hit Time % Time % Count Function --------------------------------------------------------- 248146.507 87.5 248146.507 87.5 249 _WaitOnEvent (dmwrdata.obj) 8795.822 3.1 8795.822 3.1 393329 _WriteDataToFile (dmwrdata. 4413.518 1.6 4413.518 1.6 2750 _GetPerfDataFromRegistry (dm 3281.442 1.2 8153.656 2.9 170615 _FormatWriteThisObjectCount 3268.991 1.2 12737.758 4.5 96912 _FindPreviousObjectInstanceC 2951.455 1.0 2951.455 1.0 3330628 _NextCounterDef (dmwrdata.ob 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 14 VC++ profiler: Observations ■ Our program is “sleeping” 87.5% of the time! ■ Can only look at your program’s code ♦ If your function is spending all its time making Win32 API calls or calling other dlls, they are not visible ■ Parent-child relationships among modules are not readily apparent 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 15 Rational Visual Quantify ■ Add-on product ♦ Visual Studio “integration” ■ Select profiling at the level of the function call or the line ■ Adds instrumentation to each module during the runtime session ♦ Includes all shareable and relocatable exes and dlls called by your program! 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 16 Rational Visual Quantify ■ Reporting ♦ graphic view of your program’s critical execution path ■ breaks out dlls and some system services ♦ parent-child relationships among modules is explicit ♦ convenient navigation between views ■ Performs analysis of ∆ between two execution runs 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 17 Rational Visual Quantify 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 20 Rational VQ: Observations ■ Added instrumentation affects absolute function time values observed ♦ We only spent 32% of our time “Sleeping” ♦ relative timing relationship between functions appear unaffected ■ App is very intuitive and easy to use ♦ e.g., relationships between function calls ■ Ability to trace module execution through 3rd party functions can be very useful! 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 23 Intel vTune ■ Standalone execution profiler ■ Relies on system-wide sampling ♦ maps the location of the Program C ounter to the module in memory ♦ catches every program, including the OS ■ Optionally, can also be used to report on the Pentium/Pentium Pro performance metrics during program execution 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 24 Intel vTune ■ High percentage of samples showed NT running the Idle Thread! ■ Switched to Master Collection set once per second to generate more activity ♦ Rational VQ overhead was too high to perform a comparable test ♦ R esult: very different profile of activity 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 26 Intel vTune ■ Hotspot analysis showed two functions accounted for > 70% of the activity inside our process address space ♦ NextInstanceDef ♦ IsPreviousAndParentSameInstance ■ vTune analyzes x86 assembler code to assist you in taking advantage of the superscalar features of the P5 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 28 Intel processor performance overview ■ Complex Instruction set (CISC) ♦ Maintain upward compatibility with original 8-bit 8080 instruction set ■ With improvements in semiconductor fabrication, add ♦ pipelining, TLB, cache, branch prediction ■ 486 ♦ elements of RISC superscalar processors ■ Pentium, Pentium Pro 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 29 Intel processor evolution Processor Year Clock Speed Bus Width Addressable Transistors (MHz) (bits) Mem or y 8080 1974 2 8 64K 6,000 8086 1978 5-10 16 1 MB 29,000 8088 1979 5-8 8 1 MB 29,000 80286 1982 8-12 16 16 MB 134,000 386 DX 1985 16-33 32 4 GB 275,000 486 DX 1989 25-50 32 4 GB 1,200,000 Pentium 1993 60-233 32 4 GB 3,100,000 Pentium Pro 1995 150-200 64 4 GB 5,500,000 Pentium II 1997 233-333 64 4 GB 7,500,000 1998 Demand Technology, Inc. Wintel application tuning: processor optimization 30 Intel processor evolution Processor Highlights 8080 1 chip microprocessor 8086 10X performance 8080 8088 8 bit version of 8086 80286 Virtual