ROBUST SOFTWARE DEVELOPMENT BUG PREVENTION and ISOLATION Erika Dignam and Ross Cunniff 04 April 2016 ABOUT US
Total Page:16
File Type:pdf, Size:1020Kb
April 4-7, 2016 | Silicon Valley ROBUST SOFTWARE DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross Cunniff 04 April 2016 ABOUT US Ross Cunniff Senior Software Engineer and NVIDIA SPEC representative. 15-year NVIDIA employee. Over 30 years of computer engineering experience. Erika Dignam Technical Program Manager and Bug Triager Studied computer arts. At NVIDIA for 9 years. 4/25/2016 2 Bug types | Triage and Tools | Recap STRUCTURE Process Details | Bookkeeping Prevention and Benchmarking 3 BUG TYPES Crash or TDR Corruption Performance SLI Scaling 4/25/2016 4 TOOLS AND TRIAGE Traces – All bug types What is a trace? Intercepts calls between application and driver | Records to a file NV apitrace APP Driver file.trace • Apitrace (DX and OpenGL) - http://apitrace.github.io/ • Pass along .trace file – Replay, performance info, and dump API stream • Simple to use - copy <API>.dll to executable location • Caveats - Long reproductions means large files | Tracing tools don’t always capture | Some apps are not tracing friendly out of the box 4/25/2016 5 TOOLS AND TRIAGE Traces More Tracing tools • GLIntercept (OpenGL) - https://github.com/dtrebilco/glintercept • Useful for error states and other tracing, a little older than apitrace • Copy opengl.dll and gliConfig.ini to executable folder location • Swapping the DebugContext.ini config file can give very helpful information, for example issues with SLI Scaling EXAMPLE: • OpenGL: Performance(Medium) 131234: SLI performance warning: SLI AFR copy and synchronization for texture mipmaps (42) 4/25/2016 6 TOOLS AND TRIAGE Crashes/TDR Dump files • Mini dump - Always helpful, you can simply right click the process from the task manager or process explorer and select “Dump to File” • Full dump - Better, but larger • https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx TDR – Timeout Detection and Recovery • Increase the TDR delay, what are the results then? • https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx 4/25/2016 7 TOOLS AND TRIAGE CPU Profilers - Performance Intel VTune • In-depth perf analysis, finer tuned control, filters noise | Needs a license, not free • https://software.intel.com/en-us/intel-vtune-amplifier-xe AMD CodeAnalyst • Simple, free, runs on both CPUs | Less robust than Vtune, no longer supported • http://developer.amd.com/tools-and-sdks/archive/amd-codeanalyst- performance-analyzer/ App bound? Driver bound? GPU bound? Performance paths taken 4/25/2016 8 TOOLS AND TRIAGE Performance/Resources Process Explorer • Free quick overview tool - Check loaded .dlls, can see load on resources, memory leaks, GPU or CPU bound • https://technet.microsoft.com/en-us/sysinternals/processexplorer.aspx GPUview • Free Windows tool included with the Windows Performance Toolkit (WPT) • https://graphics.stanford.edu/~mdfisher/GPUView.html • https://developer.nvidia.com/content/are-you-running-out-video-memory- detecting-video-memory-overcommitment-using-gpuview 4/25/2016 9 PROCESS EXPLORER 10 TOOLS AND TRIAGE Tools gDEBugger • http://www.gremedy.com/ • Free OpenGL debugging tool • Useful for data gathering, good for tracking state changes, dynamically look at stream • EXAMPLE: • Polygon count information from models • Performance bug was root caused to one mode of the model was sending a significant amount more polys into the OpenGL pipeline. 4/25/2016 11 NVIDIA TOOLS AND LOGS NVIDIA OpenGL Driver Error codes External Swak = Swiss Army Knife • NVIDIA tool used to capture detailed system information • Only available under NDA, on the partners site WSAppNotifier.exe – Profiles • For application profile problems, tells you which profiles are running/applied • You may have to launch the app twice • NDA only, on partner site 4/25/2016 12 WSAPPNOTIFIER.EXE 13 TRIAGE/DEBUGGING Profiles – Things to Try Changing Global Profiles • Workstation App - Dynamic Streaming | Turns off some optimized driver paths • 3D App – Game Development | Simulates a GeForce • SLI Aware Application | SLI performance testing • Threaded optimization = OFF | In Profile settings Notebooks • Try setting NVIDIA GPU to default | In profiles or SBIOS if available 4/25/2016 14 RECAP What tools for what bugs Crash or TDR • TDR Delay RegKeys | Collect dump files | Trace | GPUView Corruption • Trace | Changing profiles Performance • Changing profiles | apitrace | VTune/CodeAnalyst SLI Scaling • Debug Context from GLIntercept 4/25/2016 15 TRIAGE/DEBUGGING Vulkan https://www.khronos.org/vulkan/ • New API that puts the application developer in control, appDev manages GPU memory and resources Built in Validation Layer – API violations SDK - https://vulkan.lunarg.com/signin | Need account Demos • https://github.com/SaschaWillems/Vulkan | https://github.com/McNopper/Vulkan Renderdoc | Graphics Debugger https://github.com/baldurk/renderdoc 4/25/2016 16 TRIAGE/DEBUGGING Vulkan Vulkan Talks • S6818 – Vulkan and NVIDIA: The Essentials • S6138 – GPU Driven Rendering in Vulkan and OpenGL • S6133 – VKCPP: A C++ Layer on Top of Vulkan • Three Hangouts, Monday and Tuesday afternoons Resources • https://github.com/KhronosGroup/Khronosdotorg/blob/master/api/vulkan/resources.md 4/25/2016 17 BUG PROCESS Normal External Bug Flow • External Bug -> QA -> Triage -> Engineering Accounts to file bugs • partners.nvidia.com – Needs NDA • developer.nvidia.com\join • Access to early release drivers and NVIDIA tools, report bugs! 4/25/2016 18 BUG PROCESS Overview NVBUGS Start by filing as a software issue Important to have basic reproduction steps • OS, driver, card, application and version if applicable, system information, frequency • Severity and impact for you • Type - Performance, Crash, Corruption, TDR Regression information is very helpful if can be provided 4/25/2016 19 20 TOOLS AND TRIAGE Overview Simple app/license • A trace would be great, no license/app/model needed • Avoids delays, very useful when a third party has a repro others can’t get • If not possible, then models/scenes/app/license/demo will be needed – Time sink What to attach to bugs • Logs, traces, performance snap shots, dump files, videos, event logs • System information via externSwak (NVTOOL) 4/25/2016 21 WHAT HAPPENS TO YOUR BUG Fixes -> Driver | Branches ODE = Optimized Driver for Enterprise QNF = Quadro New Feature • Long lived branch • Short lived branch • Multiple releases or dot version per • One release per branch branch • Release driver for testing new • For production use and features and fixes certification WHQL = Windows Hardware Quality Labs Testing and Signed 22 4/25/2016 PREVENTION What NVIDIA does ATP and QA • We have QA teams with application experts around the world testing applications, GPUs, OSs, and drivers • ATP is our automated test harness for further testing to cover more configurations DVS • Driver Validation System. Automated and run with every single code change. 10 million images/tests per day German Test Lab and Global Test Lab • 24/7 automated testing of professional applications and features 4/25/2016 23 PREVENTION Best Process We want benchmarks and test suites! • Early detection of bugs and issues • Early detection of performance regressions • Get involved in industry standard benchmarks, example SPEC Over to Ross to discuss Performance Benchmark creation! 4/25/2016 24 PERFORMANCE BENCHMARKING A key to high-quality user experience 25 “WHEN YOU CANNOT MEASURE IT… …your knowledge is of a meagre and unsatisfactory kind” – Lord Kelvin Anything a computer can do, a human can do. Given enough time… Computers are accelerators. Without good performance, user experience is bad. Benchmarking is the technique to ensure repeatable performance 26 WHAT MAKES A BENCHMARK? Originally a surveying mark which provided a repeatable reference for placing a leveling rod. Key attributes: #1: repeatable #2: accurate #3: reportable 27 UNITS ARE NOT BENCHMARKS Many common units exist: MIPS, FLOPS, FPS, LPM, … Just because you can run a test and get units out, does not make your test a benchmark Quiz: if your test returns a result 60 FPS, what might you be measuring? What about 30, 20, 15, … FPS? 28 REPEATABILITY First principle: make sure the same operations are benchmarked on all configs Most benchmarks exhibit some randomness in performance The causes are many; some examples: Non-deterministic operating system process / thread scheduler Disk I/O – variable times to reach a sector with rotational media; variable wear leveling for solid state media Build-to-build variation due to cache layout changes Virus scan cycles Rule of thumb: a variation of up to 5% is generally acceptable (if higher, use multiple runs and rely on regression toward the mean) 29 ACCURACY “Do these numbers reflect reality?” Always verify assumptions. Do you expect your benchmark to be GPU limited? Then verify on GPUs with different performance levels. Faster is not always better – ensure work is actually being done that reflects end- user experience A good benchmark has a means to verify correct operation Make sure the key portion of your benchmark runs long enough that you can actually measure its performance, not virtual memory subsystem latency or other irrelevant metrics 30 NOTES ON TUNING If you are not measuring properly, you might not be able to make improvements 60Hz example – sync-to-vblank (default on NVIDIA) Bottleneck shift. A graphics benchmark may start CPU/API-limited, then after tuning move to being limited by GPU vertex processing. Or even change to being limited by pixel-processing as window sizes change or as workloads