April 4-7, 2016 | Silicon Valley

ROBUST SOFTWARE DEVELOPMENT BUG PREVENTION AND ISOLATION Erika Dignam and Ross Cunniff 04 April 2016 ABOUT US

Ross Cunniff Senior Software Engineer and NVIDIA SPEC representative. 15-year NVIDIA employee. Over 30 years of computer engineering experience. Erika Dignam Technical Program Manager and Bug Triager Studied computer arts. At NVIDIA for 9 years.

4/25/2016 2 Bug types | Triage and Tools | Recap STRUCTURE Process Details | Bookkeeping Prevention and Benchmarking

3 BUG TYPES

Crash or TDR

Corruption

Performance

SLI Scaling

4/25/2016 4 TOOLS AND TRIAGE Traces – All bug types What is a trace? Intercepts calls between application and driver | Records to a file NV apitrace APP Driver

file.trace • Apitrace (DX and OpenGL) - http://apitrace.github.io/ • Pass along .trace file – Replay, performance info, and dump API stream • Simple to use - copy .dll to executable location • Caveats - Long reproductions means large files | Tracing tools don’t always capture | Some apps are not tracing friendly out of the box

4/25/2016 5

TOOLS AND TRIAGE Traces More Tracing tools • GLIntercept (OpenGL) - https://github.com/dtrebilco/glintercept • Useful for error states and other tracing, a little older than apitrace • Copy opengl.dll and gliConfig.ini to executable folder location • Swapping the DebugContext.ini config file can give very helpful information, for example issues with SLI Scaling EXAMPLE:

• OpenGL: Performance(Medium) 131234: SLI performance warning: SLI AFR copy and synchronization for texture mipmaps (42)

4/25/2016 6

TOOLS AND TRIAGE Crashes/TDR Dump files

• Mini dump - Always helpful, you can simply right click the process from the task manager or process explorer and select “Dump to File”

• Full dump - Better, but larger

• https://msdn.microsoft.com/en-us/library/windows/desktop/bb787181(v=vs.85).aspx TDR – Timeout Detection and Recovery

• Increase the TDR delay, what are the results then?

• https://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx

4/25/2016 7 TOOLS AND TRIAGE CPU Profilers - Performance VTune • In-depth analysis, finer tuned control, filters noise | Needs a license, not free • https://software.intel.com/en-us/intel-vtune-amplifier-xe AMD CodeAnalyst • Simple, free, runs on both CPUs | Less robust than Vtune, no longer supported • http://developer.amd.com/tools-and-sdks/archive/amd-codeanalyst- performance-analyzer/ App bound? Driver bound? GPU bound? Performance paths taken

4/25/2016 8 TOOLS AND TRIAGE Performance/Resources Process Explorer • Free quick overview tool - Check loaded .dlls, can see load on resources, memory leaks, GPU or CPU bound • https://technet.microsoft.com/en-us/sysinternals/processexplorer.aspx GPUview • Free Windows tool included with the Windows Performance Toolkit (WPT) • https://graphics.stanford.edu/~mdfisher/GPUView.html • https://developer.nvidia.com/content/are-you-running-out-video-memory- detecting-video-memory-overcommitment-using-gpuview

4/25/2016 9 PROCESS EXPLORER 10 TOOLS AND TRIAGE Tools gDEBugger • http://www.gremedy.com/ • Free OpenGL debugging tool • Useful for data gathering, good for tracking state changes, dynamically look at stream • EXAMPLE:

• Polygon count information from models

• Performance bug was root caused to one mode of the model was sending a significant amount more polys into the OpenGL pipeline.

4/25/2016 11

NVIDIA TOOLS AND LOGS

NVIDIA OpenGL Driver Error codes External Swak = Swiss Army Knife • NVIDIA tool used to capture detailed system information • Only available under NDA, on the partners site WSAppNotifier.exe – Profiles • For application profile problems, tells you which profiles are running/applied • You may have to launch the app twice • NDA only, on partner site

4/25/2016 12

WSAPPNOTIFIER.EXE 13 TRIAGE/DEBUGGING Profiles – Things to Try

Changing Global Profiles

App - Dynamic Streaming | Turns off some optimized driver paths

• 3D App – Game Development | Simulates a GeForce

• SLI Aware Application | SLI performance testing

• Threaded optimization = OFF | In Profile settings Notebooks

• Try setting NVIDIA GPU to default | In profiles or SBIOS if available

4/25/2016 14 RECAP What tools for what bugs Crash or TDR

• TDR Delay RegKeys | Collect dump files | Trace | GPUView Corruption

• Trace | Changing profiles Performance

• Changing profiles | apitrace | VTune/CodeAnalyst SLI Scaling

• Debug Context from GLIntercept

4/25/2016 15 TRIAGE/DEBUGGING Vulkan https://www.khronos.org/vulkan/

• New API that puts the application developer in control, appDev manages GPU memory and resources Built in Validation Layer – API violations

SDK - https://vulkan.lunarg.com/signin | Need account Demos

• https://github.com/SaschaWillems/Vulkan | https://github.com/McNopper/Vulkan

Renderdoc | Graphics Debugger https://github.com/baldurk/renderdoc

4/25/2016 16

TRIAGE/DEBUGGING Vulkan Vulkan Talks

• S6818 – Vulkan and NVIDIA: The Essentials

• S6138 – GPU Driven Rendering in Vulkan and OpenGL

• S6133 – VKCPP: A C++ Layer on Top of Vulkan

• Three Hangouts, Monday and Tuesday afternoons Resources

• https://github.com/KhronosGroup/Khronosdotorg/blob/master/api/vulkan/resources.md

4/25/2016 17 BUG PROCESS

Normal External Bug Flow • External Bug -> QA -> Triage -> Engineering Accounts to file bugs • partners.nvidia.com – Needs NDA • developer.nvidia.com\join • Access to early release drivers and NVIDIA tools, report bugs!

4/25/2016 18 BUG PROCESS Overview NVBUGS Start by filing as a software issue Important to have basic reproduction steps • OS, driver, card, application and version if applicable, system information, frequency • Severity and impact for you • Type - Performance, Crash, Corruption, TDR Regression information is very helpful if can be provided

4/25/2016 19

20 TOOLS AND TRIAGE Overview Simple app/license • A trace would be great, no license/app/model needed • Avoids delays, very useful when a third party has a repro others can’t get • If not possible, then models/scenes/app/license/demo will be needed – Time sink What to attach to bugs • Logs, traces, performance snap shots, dump files, videos, event logs • System information via externSwak (NVTOOL)

4/25/2016 21 WHAT HAPPENS TO YOUR BUG Fixes -> Driver | Branches

ODE = Optimized Driver for Enterprise QNF = Quadro New Feature • Long lived branch • Short lived branch • Multiple releases or dot version per • One release per branch branch • Release driver for testing new • For production use and features and fixes certification

WHQL = Windows Hardware Quality Labs Testing and Signed

22 4/25/2016 PREVENTION What NVIDIA does ATP and QA

• We have QA teams with application experts around the world testing applications, GPUs, OSs, and drivers

• ATP is our automated test harness for further testing to cover more configurations DVS

• Driver Validation System. Automated and run with every single code change. 10 million images/tests per day German Test Lab and Global Test Lab

• 24/7 automated testing of professional applications and features

4/25/2016 23 PREVENTION Best Process We want benchmarks and test suites!

• Early detection of bugs and issues

• Early detection of performance regressions

• Get involved in industry standard benchmarks, example SPEC

Over to Ross to discuss Performance Benchmark creation!

4/25/2016 24 PERFORMANCE BENCHMARKING A key to high-quality user experience

25 “WHEN YOU CANNOT MEASURE IT… …your knowledge is of a meagre and unsatisfactory kind” – Lord Kelvin

Anything a computer can do, a human can do. Given enough time… Computers are accelerators. Without good performance, user experience is bad. Benchmarking is the technique to ensure repeatable performance

26 WHAT MAKES A BENCHMARK?

Originally a surveying mark which provided a repeatable reference for placing a leveling rod. Key attributes: #1: repeatable #2: accurate #3: reportable

27 UNITS ARE NOT BENCHMARKS

Many common units exist: MIPS, FLOPS, FPS, LPM, … Just because you can run a test and get units out, does not make your test a benchmark Quiz: if your test returns a result 60 FPS, what might you be measuring? What about 30, 20, 15, … FPS?

28 REPEATABILITY

First principle: make sure the same operations are benchmarked on all configs Most benchmarks exhibit some randomness in performance The causes are many; some examples:

Non-deterministic process / thread scheduler

Disk I/O – variable times to reach a sector with rotational media; variable wear leveling for solid state media

Build-to-build variation due to cache layout changes

Virus scan cycles Rule of thumb: a variation of up to 5% is generally acceptable (if higher, use multiple runs and rely on regression toward the mean) 29

ACCURACY

“Do these numbers reflect reality?” Always verify assumptions. Do you expect your benchmark to be GPU limited? Then verify on GPUs with different performance levels. Faster is not always better – ensure work is actually being done that reflects end- user experience A good benchmark has a means to verify correct operation Make sure the key portion of your benchmark runs long enough that you can actually measure its performance, not virtual memory subsystem latency or other irrelevant metrics

30 NOTES ON TUNING

If you are not measuring properly, you might not be able to make improvements 60Hz example – sync-to-vblank (default on NVIDIA) Bottleneck shift. A graphics benchmark may start CPU/API-limited, then after tuning move to being limited by GPU vertex processing. Or even change to being limited by pixel-processing as window sizes change or as workloads shift Constantly re-evaluate benchmark assumptions when tuning

31 REPORTABILITY

Your benchmark should yield a metric – FPS, LPM, etc. – that is easily collected for further processing Output in standard formats – CSV, JSON, XML – many tools to format and compare If your benchmark is repeatable, accurate, and has good reports, you should be able to track performance over multiple builds / revisions of your application You will also be able to track performance over other changing variables: OS, CPU, GPU driver, memory size, … Important: select a reference score, and keep it constant if at all possible – avoid normalization of deviance If weighting multiple subtests, consider relative importance of subtest to your user community. Use the geometric mean where appropriate.

32

EXAMPLE BENCHMARKS – SPEC APC Clockwise from right: • 3dsmax 2015 • PTC Creo 3 • Maya 2012 • SNX 8.5 • Solidworks 2015

33 EXAMPLE BENCHMARKS - SPEC VIEWPERF 12

Clockwise from right: Catia-04 Creo-01 Maya-04 Medical-01 Showcase-01 SNX-02 SW-03

34 SNX-02 RESULTS SNAPSHOT Generated automatically from XML produced by viewset

35 SNX-02 DETAILS

Test 1 Test 2 Test 5

Test 6 Test 8 Test 10

36 MORE SNX-02 DETAILS

Note varying weights – sum of all is 100%

37 MORE INFORMATION

SPEC benchmarking group – http://www.spec.org SPEC Graphics and Workstation Performance Group (GWPG): http://www.spec.org/gwpg/publish/gpcfaqs.html Contribute to SPEC GWPG: http://www.spec.org/gwpg/publish/develop_bench.html

“SPEC's Graphics and Workstation Performance Group (SPEC/GWPG) is seeking ISVs, software user groups, publication editors and testing lab directors to help develop and maintain standardized benchmarks based on professional graphics and workstation applications. Organizations or individuals can submit existing benchmarks for consideration by a project group or help the group develop an entirely new benchmark.”

38 BENCHMARK CALL TO ACTION

Benchmark what matters to your users! Create good benchmarks Help us help you! – share your benchmark with NVIDIA and we will put it in our driver regression automation suite to prevent performance bugs Consider sharing application benchmarks with SPEC

39 CALL TO ACTION

Help us help you! Create good unit tests and benchmarks Make use of available software development and analysis tools Be systematic in development and testing Share your unit tests and benchmarks with us (especially if there is a problem) Be clear and concise in your bug reports

40 CONTACT US

Ross Cunniff – [email protected]

Erika Dignam – [email protected]

41 April 4-7, 2016 | Silicon Valley

QUESTIONS?

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join 43 TRIAGE/DEBUGGING Things to Look At

Looking at dumps

• Use winDebug (in Windows SDK) and load a dump file

• Check the call stack, see who’s there Performance

• Check where time is spent in your perf logs

4/25/2016 44