Debugging Memory Leaks TotalView Overview Hybrid and Accelerated Applications

• What do we see – NVIDIA Tesla GP-GPU computational accelerators – Intel Xeon Phi Coprocessors – Complex memory hierarchies (numa, device vs host, etc) – Custom languages such as CUDA and OpenCL – Directive based programming such as OpenACC and OpenMP – Core and thread counts going up • A lot of complexity to deal with if you want performance – C or Fortran with MPI starts to look “simple” – Everything is Multiple Languages / Parallel Paradigms – Up to 4 “kinds” of parallelism (cluster, thread, heterogeneous, vector) – Data movement and load balancing

© 2014 Rogue Wave Software, Inc. All Rights Reserved 3 How does Rogue Wave help?

TotalView

• Troubleshooting and analysis tool

– Visibility Into

– Control Over

• Scalability

• Usability

• Support for HPC platforms and languages

© 2014 Rogue Wave Software, Inc. All Rights Reserved What is TotalView®?

Application Analysis and Tool: Code Confidently • Debug and Analyse C/C++ and Fortran on Linux™, Unix or Mac OS X • Laptops to supercomputers • Makes developing, maintaining, and supporting critical apps easier and less risky

Major Features • Easy to learn graphical user interface with data visualization • Parallel Debugging – MPI, Pthreads, OpenMP™, GA, UPC – CUDA™, OpenACC®, and Intel® Xeon Phi™ coprocessor • Low tool overhead resource usage • Includes a Remote Display Client which frees you to work from anywhere • Memory Debugging with MemoryScape™ • Deterministic Replay Capability Included on Linux/x86-64 • Non-interactive Batch Debugging with TVScript and the CLI • TTF & C++View to transform user defined objects

© 2014 Rogue Wave Software, Inc. All Rights Reserved TotalView for the NVIDIA ® GPU Accelerator

• NVIDIA Tesla K40 • NVIDIA CUDA 5.0 and 5.5 new in 8.13 – With limited support for dynamic parallelism • Cray CCE OpenACC • Features and capabilities include – Support for MPI based clusters and multi-card configurations – Flexible Display and Navigation on the CUDA device • Physical (device, SM, Warp, Lane) • Logical (Grid, Block) tuples – CUDA device window reveals what is running where – Support for types and separate memory address spaces – Leverages CUDA memcheck

© 2014 Rogue Wave Software, Inc. All Rights Reserved HRL Laboratories

• “In the first full day of using TotalView, we were quickly able to solve the bug that had us stumped for weeks. “ Kirill Minkovich, HRL Laboratories

• Neural Network Simulation in the Defense Industry

• Need to adopt new technology and achieve massive scaling on a deadline

• Project went from failing to over-achieving

© 2014 Rogue Wave Software, Inc. All Rights Reserved 7 TotalView for the Intel® Xeon Phi™ coprocessor

Supports All Major Intel Xeon Phi Coprocessor Configurations • Native Mode – With or without MPI • Offload Directives – Incremental adoption, similar to GPU • Symmetric Mode - New in 8.13 – Host and Coprocessor • Multi-device, Multi-node • Clusters

User Interface • MPI Debugging Features – Process Control, View Across, Shared Breakpoints • Heterogeneous Debugging – Debug Both Xeon and Intel Xeon Phi Processes Memory Debugging - New in 8.13 • Both native and symmetric mode

© 2014 Rogue Wave Software, Inc. All Rights Reserved Beacon Project

• 210 Tflops, 11,520 Xeon Phi Coprocessor Cores • #1 on the Dec 2012 Green 500

• 9 scientific teams optimizing apps including: MHD, Plasma, Cosmology, Chemistry, QCD, Bio- informatics…

• Porting & optimization – Hybrid MPI + OpenMP – Many more threads than previous paradigms

• Subtle issues might present themselves – And did

© 2014 Rogue Wave Software, Inc. All Rights Reserved 9 Debugging on Beacon with TotalView

• OpenMP Hybridization of Boltzman BGK – Correctness issues came up with the OpenMP code – Troubleshooting with TotalView • Native mode debugging on the Xeon Phi • Thread level examination of the OpenMP region • Comparison of data between threads • … Clarified otherwise puzzling results – Developers were able to resolve the correctness issue – Ultimately obtained performance gains on the Xeon Phi • And correct results

© 2014 Rogue Wave Software, Inc. All Rights Reserved 10 What Is MemoryScape®?

• Runtime Memory Analysis : Eliminate Memory Errors – Detects memory leaks before they are a problem – Explore heap memory usage with powerful analytical tools – Use for validation as part of a quality software development process

• Major Features – Included in TotalView, or Standalone – Detects • Malloc API misuse • Memory leaks • Buffer overflows – Supports • Now includes Intel® Xeon Phi™ • C, C++, Fortran • Linux, Unix, and Mac OS X • MPI, pthreads, OMP, and remote apps – Low runtime overhead – Easy to use • Works with vendor libraries • No recompilation or instrumentation

© 2014 Rogue Wave Software, Inc. All Rights Reserved Dassault Systems

• “MemoryScape enabled us to identify memory • “MemoryScapeissues, and by enabled using usits to scripting … proactively interface, fix potential we were able to automate the evaluation process. Now, the errorssystem prior automatically to release.” uncovers any hidden latent errors– Nick Monyatovsky in our code, Software with Engineer every build, allowing our developers to proactively fix potential errors prior to • Computerrelease.” Aided Engineering ISV for Aero/Auto/Industry – Nick Monyatovsky, Software Engineer at SIMULIA

• Struggling with intermittent errors

• Continuous Integration – Better Product Quality

© 2014 Rogue Wave Software, Inc. All Rights Reserved 12 Deterministic Replay Debugging

• Reverse Debugging: Radically simplify your debugging

– Captures and Deterministically Replays • Not just “checkpoint and restart” – Eliminate the Restart Cycle and Hard-to-Reproduce Bugs – Step Back and Forward by Function, Line, or Instruction

• Specifications – A feature included in TotalView on Linux x86 and x86-64 • No recompilation or instrumentation • Explore data and state in the past just like in a live process, including C++View transformations – Replay on Demand: enable it when you want it – Supports MPI on Ethernet, Infiniband, Cray XE Gemini – Supports Pthreads, and OpenMP

© 2014 Rogue Wave Software, Inc. All Rights Reserved Cambridge Study

• Survey conducted by the Judge Business School at Cambridge University concluded that Reverse allow users, on average, to spend 13% less of their programming time debugging. – Programming was 50% of total work week on average – Debugging was 50% of programming time without reverse debugging – Debugging was 37% of programming time with reverse debugging – That frees up 130 hours (>3 work weeks, 6.5% total time) per developer per year for design and new feature development • The survey looked at total value (salaries & overhead) of debugging as a task and they determined that this savings could, across the whole world economy, be work $41 billion in increased productivity. – The productivity improvement should be worth $2,500 per developer per year (salary only) or $5,000 per year with overhead. • http://www.roguewave.com/company/news-events/press-releases/2013/university-of-cambridge- reverse-debugging-study.aspx

© 2014 Rogue Wave Software, Inc. All Rights Reserved 14 Memory Debugging What is a Memory Bug?

• A Memory Bug is a mistake in the management of heap memory

• Failure to check for error conditions

• Leaking: Failure to free memory

• Dangling references: Failure to clear pointers

• Writing to memory not allocated

• Overrunning array bounds

16 Why Are Memory Bugs Hard to Find?

What is a Memory Bug?

• Memory problems can lurk • For a given scale or platform or problem, they may not be fatal • Libraries could be source of problem • The fallout can occur at any subsequent memory access through a pointer • The mistake is rarely fatal in and of itself • The mistake and fallout can be widely separated

• Potentially 'racy' • Memory allocation pattern non-local • Even the fallout is not always fatal. It can result in data corruption which may or may not result in a subsequent crash

• May be caused by or cause of a 'classic' bug 1 7 The Agent and Interposition

Process

User Code and Libraries

TotalView

Malloc API

18 The Agent and Interposition

Process

User Code and Libraries

Heap Interposition TotalView Allocation Deallocation Table Agent (HIA) ‏ Table

Malloc API

19 TotalView HIA Technology

• Advantages of TotalView HIA Technology

• Use it with your existing builds • No or Binary Instrumentation • Programs run nearly full speed • Low performance overhead • Low memory overhead • Efficient memory usage • Support wide range of platforms and

20 Memory Debugger Features

• Automatically detect allocation problems • View the heap • Leak detection • Block painting • Memory Hoarding • detection • Deallocation/reallocation notification • Memory Corruption Detection - Guard Blocks • Memory Comparisons between processes • Collaboration features

21 Enabling Memory Debugging Memory Event Notification

22 Memory Event Details Window

23 Heap Graphical View

24 Leak Detection

• Leak Detection • Based on Conservative Garbage Collection

• Can be performed at any point in runtime

• Helps localize leaks in time

• Multiple Reports

• Backtrace Report

• Source Code Structure

• Graphically Memory Location

25 Dangling Pointer Detection

26 Memory Corruption Report

27 Memory Comparisons

• “Diff” live processes

• Compare processes across cluster • Compare with baseline

• See changes between point A and point B

• Compare with saved session

• Provides memory usage change from last run

28 Memory Usage Statistics

29 Memory Reports

• Multiple Reports • Memory Statistics • Interactive Graphical Display • Source Code Display • Backtrace Display • Allow the user to • Monitor Program Memory Usage • Discover Allocation Layout • Look for Inefficient Allocation • Look for Memory Leaks

30 MemoryScape

Multi-Process and Multi-Thread

• Memory debug many processes at the same time • MPI • Client-Server • Fork-Exec • Compare two runs • Remote applications • Mutli-threaded applications

31 Script Mode - MemScript

• Automation Support • MemoryScape lets users run tests and check programs for memory leaks without having to be in front of the program • Simple command line program called MemScript • Doesn’t start up the GUI • Can be run from within a script or test harness • The user defines • What configuration options are active • What thing to look for • Actions MemoryScape should take for each type of event that may occur

32 MemoryScape

Red Zones instant array bounds detection for Linux

• Red Zones provides: • Immediate detection of memory overruns. • Detection of access violations both before and after the bounds of allocated memory. • Detection of deallocated memory accesses. • Red Zones events • MemoryScape will stop your programs execution and raise an event alerting you to the illegal access. You will be able to see exactly where your code over stepped the bounds.

• Note: Not available on the Blue Gene/Q

33 MemoryScape

Configuring Guard Blocks

Guard allocated memory When selected, the Memory Debugger writes guard blocks before and after a memory block that your program allocates

Pre-Guard and Post-Guard Size: Sets the size in bytes of the block that the Memory Debugger places immediately before and after the memory block that your program allocates

Pattern: Indicates the pattern that the Memory Debugger writes into guard blocks. The default values are 0x77777777 and 0x99999999 3 4 Memory Corruption Detection (Guard Blocks)

35 TVScript tvscript and memscript

• A straightforward language for unattended and/or batch debugging with TotalView and/or MemoryScape • Usable whenever jobs need to be submitted or batched • Can be used for automation • A more powerful version of printf, no recompilation necessary between runs • Schedule automated debug runs with cron jobs • Expand its capabilities using TCL

37 Output

• All of the following information is provided by default for each print • Process id • Thread id • Rank • Timestamp • Event/Action description • A single output file is written containing all of the information regardless of the number of processes/threads being debugged

38 Sample Output

• Simple interface to create an action point -create_actionpoint ”#85=>print foreign_addr” • Sample output with all information !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! Print ! ! Process: ! ./TVscript_demo (Debugger Process ID: 5, System ID: [email protected]) ! Thread: ! Debugger ID: 5.1, System ID: 3077191888 ! Rank: ! 0 ! Time Stamp: ! 05-14-2012 17:11:24 ! Triggered from event: ! actionpoint ! Results: ! err_detail = { ! intervals = 0x0000000a (10) ! almost_pi = 3.1424259850011 ! delta = 0.000833243988525023 ! } ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

39 Events

• General • any_event • Source code debugging events • actionpoint • error • Memory events (just a few, all are listed in Chapter 4 of TotalView Reference Guide) • any_memory_event • free_not_allocated • guard_corruption • rz_overrun, rz_underrun, rz_use_after_free

40 Actions

• Source code • display_backtrace [-level num] [numlevels] [options] • print [-slice {exp}] {variable | exp} • Memory • check_guard_blocks • list_allocations • list_leaks • save_html_heap_status_source_view • save_memory_debugging_file • save_text_heap_status_source_view

41 Command syntax

• General syntax • tvscript [options] [filename] –a [program_args] • MPI Options • -mpi starter starter comes from Parallel tab dropdown • -starter_args “args for starter program” • -nodes • -np or –procs or –tasks

42 Command syntax

• Action options • -create_actionpoint “src_expr[=>action1[,action2] …]” • Repeat on command line for each actionpoint • -event_action “event_action_list” • event1=action1,event2=action2 or event1=>action1,action2 • Can repeat on command line for multiple actions • General options • -display_specifiers “display_specifiers_list” • -maxruntime “hh:mm:ss” • -script_file scriptFile • -script_log_filename logFilename • -script_summary_log_filename summaryLogFilename

43 Command syntax

• Memory debugging options • -memory_debugging (must use for debugging memory) • -mem_detect_leaks • -mem_detect_use_after_free • -mem_guard_blocks • -mem_hoard_freed_memory • -mem_hoard_low_memory_threshold nnnn • -mem_paint_all • -mem_paint_on_alloc • -mem_paint_on_dealloc

44 Command syntax

• Memory debugging red zone options • -mem_red_zone_overruns • -mem_red_zones_size_ranges min:max[,min:max]… • Ranges can be • min:max • min: • :max • fixed • -mem_red_zones_underruns

45 Script Files

• Instead of putting everything on the command line, you can also write and use script files • Script files can also include TCL • Logging functions • tvscript_log msg – logs msg to the log file • tvscript_slog msg – logs msg to the summary log file • Property functions • tvscript_get_process_property process_id property • tvscript_get_thread_property thread_id property

46 Roadmap What is new in 8.13

• TotalView 8.13 – CUDA 5.0 and 5.5 – MemoryScape for Xeon Phi Native and Symmetric – Xeon Phi Symmetric – Mac OS X Mavericks – Early Access: Scalability Improvements

Next Release

Plans subject to change without notice

• TotalView 8.14 – CUDA 6.0 Support – Early Access Support for Save/Load of ReplayEngine recordings – GCC C++ Type Transform Support for Unordered Collections • Unordered Map • Unordered Set • Unordered Multimap • Unordered Multiset – Platform updates

© 2014 Rogue Wave Software, Inc. All Rights Reserved N+1 Release

Plans subject to change without notice

• TotalView 8.15 – CUDA 6.5 Support – Co-array Fortran support – OpenACC support in PGI – Scalability and Performance at Scale Improvements – Platform updates

© 2014 Rogue Wave Software, Inc. All Rights Reserved