Debugging Memory Leaks TotalView Overview Hybrid and Accelerated Applications
• What do we see – NVIDIA Tesla GP-GPU computational accelerators – Intel Xeon Phi Coprocessors – Complex memory hierarchies (numa, device vs host, etc) – Custom languages such as CUDA and OpenCL – Directive based programming such as OpenACC and OpenMP – Core and thread counts going up • A lot of complexity to deal with if you want performance – C or Fortran with MPI starts to look “simple” – Everything is Multiple Languages / Parallel Paradigms – Up to 4 “kinds” of parallelism (cluster, thread, heterogeneous, vector) – Data movement and load balancing
© 2014 Rogue Wave Software, Inc. All Rights Reserved 3 How does Rogue Wave help?
TotalView debugger
• Troubleshooting and analysis tool
– Visibility Into
– Control Over
• Scalability
• Usability
• Support for HPC platforms and languages
© 2014 Rogue Wave Software, Inc. All Rights Reserved What is TotalView®?
Application Analysis and Debugging Tool: Code Confidently • Debug and Analyse C/C++ and Fortran on Linux™, Unix or Mac OS X • Laptops to supercomputers • Makes developing, maintaining, and supporting critical apps easier and less risky
Major Features • Easy to learn graphical user interface with data visualization • Parallel Debugging – MPI, Pthreads, OpenMP™, GA, UPC – CUDA™, OpenACC®, and Intel® Xeon Phi™ coprocessor • Low tool overhead resource usage • Includes a Remote Display Client which frees you to work from anywhere • Memory Debugging with MemoryScape™ • Deterministic Replay Capability Included on Linux/x86-64 • Non-interactive Batch Debugging with TVScript and the CLI • TTF & C++View to transform user defined objects
© 2014 Rogue Wave Software, Inc. All Rights Reserved TotalView for the NVIDIA ® GPU Accelerator
• NVIDIA Tesla K40 • NVIDIA CUDA 5.0 and 5.5 new in 8.13 – With limited support for dynamic parallelism • Cray CCE OpenACC • Features and capabilities include – Support for MPI based clusters and multi-card configurations – Flexible Display and Navigation on the CUDA device • Physical (device, SM, Warp, Lane) • Logical (Grid, Block) tuples – CUDA device window reveals what is running where – Support for types and separate memory address spaces – Leverages CUDA memcheck
© 2014 Rogue Wave Software, Inc. All Rights Reserved HRL Laboratories
• “In the first full day of using TotalView, we were quickly able to solve the bug that had us stumped for weeks. “ Kirill Minkovich, HRL Laboratories
• Neural Network Simulation in the Defense Industry
• Need to adopt new technology and achieve massive scaling on a deadline
• Project went from failing to over-achieving
© 2014 Rogue Wave Software, Inc. All Rights Reserved 7 TotalView for the Intel® Xeon Phi™ coprocessor
Supports All Major Intel Xeon Phi Coprocessor Configurations • Native Mode – With or without MPI • Offload Directives – Incremental adoption, similar to GPU • Symmetric Mode - New in 8.13 – Host and Coprocessor • Multi-device, Multi-node • Clusters
User Interface • MPI Debugging Features – Process Control, View Across, Shared Breakpoints • Heterogeneous Debugging – Debug Both Xeon and Intel Xeon Phi Processes Memory Debugging - New in 8.13 • Both native and symmetric mode
© 2014 Rogue Wave Software, Inc. All Rights Reserved Beacon Project
• 210 Tflops, 11,520 Xeon Phi Coprocessor Cores • #1 on the Dec 2012 Green 500
• 9 scientific teams optimizing apps including: MHD, Plasma, Cosmology, Chemistry, QCD, Bio- informatics…
• Porting & optimization – Hybrid MPI + OpenMP – Many more threads than previous paradigms
• Subtle issues might present themselves – And did
© 2014 Rogue Wave Software, Inc. All Rights Reserved 9 Debugging on Beacon with TotalView
• OpenMP Hybridization of Boltzman BGK – Correctness issues came up with the OpenMP code – Troubleshooting with TotalView • Native mode debugging on the Xeon Phi • Thread level examination of the OpenMP region • Comparison of data between threads • … Clarified otherwise puzzling results – Developers were able to resolve the correctness issue – Ultimately obtained performance gains on the Xeon Phi • And correct results
© 2014 Rogue Wave Software, Inc. All Rights Reserved 10 What Is MemoryScape®?
• Runtime Memory Analysis : Eliminate Memory Errors – Detects memory leaks before they are a problem – Explore heap memory usage with powerful analytical tools – Use for validation as part of a quality software development process
• Major Features – Included in TotalView, or Standalone – Detects • Malloc API misuse • Memory leaks • Buffer overflows – Supports • Now includes Intel® Xeon Phi™ • C, C++, Fortran • Linux, Unix, and Mac OS X • MPI, pthreads, OMP, and remote apps – Low runtime overhead – Easy to use • Works with vendor libraries • No recompilation or instrumentation
© 2014 Rogue Wave Software, Inc. All Rights Reserved Dassault Systems
• “MemoryScape enabled us to identify memory • “MemoryScapeissues, and by enabled using usits to scripting … proactively interface, fix potential we were able to automate the evaluation process. Now, the errorssystem prior automatically to release.” uncovers any hidden latent errors– Nick Monyatovsky in our code, Software with Engineer every build, allowing our developers to proactively fix potential errors prior to • Computerrelease.” Aided Engineering ISV for Aero/Auto/Industry – Nick Monyatovsky, Software Engineer at SIMULIA
• Struggling with intermittent errors
• Continuous Integration – Better Product Quality
© 2014 Rogue Wave Software, Inc. All Rights Reserved 12 Deterministic Replay Debugging
• Reverse Debugging: Radically simplify your debugging
– Captures and Deterministically Replays Execution • Not just “checkpoint and restart” – Eliminate the Restart Cycle and Hard-to-Reproduce Bugs – Step Back and Forward by Function, Line, or Instruction
• Specifications – A feature included in TotalView on Linux x86 and x86-64 • No recompilation or instrumentation • Explore data and state in the past just like in a live process, including C++View transformations – Replay on Demand: enable it when you want it – Supports MPI on Ethernet, Infiniband, Cray XE Gemini – Supports Pthreads, and OpenMP
© 2014 Rogue Wave Software, Inc. All Rights Reserved Cambridge Study
• Survey conducted by the Judge Business School at Cambridge University concluded that Reverse Debuggers allow users, on average, to spend 13% less of their programming time debugging. – Programming was 50% of total work week on average – Debugging was 50% of programming time without reverse debugging – Debugging was 37% of programming time with reverse debugging – That frees up 130 hours (>3 work weeks, 6.5% total time) per developer per year for design and new feature development • The survey looked at total value (salaries & overhead) of debugging as a task and they determined that this savings could, across the whole world economy, be work $41 billion in increased productivity. – The productivity improvement should be worth $2,500 per developer per year (salary only) or $5,000 per year with overhead. • http://www.roguewave.com/company/news-events/press-releases/2013/university-of-cambridge- reverse-debugging-study.aspx
© 2014 Rogue Wave Software, Inc. All Rights Reserved 14 Memory Debugging What is a Memory Bug?
• A Memory Bug is a mistake in the management of heap memory
• Failure to check for error conditions
• Leaking: Failure to free memory
• Dangling references: Failure to clear pointers
• Writing to memory not allocated
• Overrunning array bounds
16 Why Are Memory Bugs Hard to Find?
What is a Memory Bug?
• Memory problems can lurk • For a given scale or platform or problem, they may not be fatal • Libraries could be source of problem • The fallout can occur at any subsequent memory access through a pointer • The mistake is rarely fatal in and of itself • The mistake and fallout can be widely separated
• Potentially 'racy' • Memory allocation pattern non-local • Even the fallout is not always fatal. It can result in data corruption which may or may not result in a subsequent crash
• May be caused by or cause of a 'classic' bug 1 7 The Agent and Interposition
Process
User Code and Libraries
TotalView
Malloc API
18 The Agent and Interposition
Process
User Code and Libraries
Heap Interposition TotalView Allocation Deallocation Table Agent (HIA) Table
Malloc API
19 TotalView HIA Technology
• Advantages of TotalView HIA Technology
• Use it with your existing builds • No Source Code or Binary Instrumentation • Programs run nearly full speed • Low performance overhead • Low memory overhead • Efficient memory usage • Support wide range of platforms and compilers
20 Memory Debugger Features
• Automatically detect allocation problems • View the heap • Leak detection • Block painting • Memory Hoarding • Dangling pointer detection • Deallocation/reallocation notification • Memory Corruption Detection - Guard Blocks • Memory Comparisons between processes • Collaboration features
21 Enabling Memory Debugging Memory Event Notification
22 Memory Event Details Window
23 Heap Graphical View
24 Leak Detection
• Leak Detection • Based on Conservative Garbage Collection
• Can be performed at any point in runtime
• Helps localize leaks in time
• Multiple Reports
• Backtrace Report
• Source Code Structure
• Graphically Memory Location
25 Dangling Pointer Detection
26 Memory Corruption Report
27 Memory Comparisons
• “Diff” live processes
• Compare processes across cluster • Compare with baseline
• See changes between point A and point B
• Compare with saved session
• Provides memory usage change from last run
28 Memory Usage Statistics
29 Memory Reports
• Multiple Reports • Memory Statistics • Interactive Graphical Display • Source Code Display • Backtrace Display • Allow the user to • Monitor Program Memory Usage • Discover Allocation Layout • Look for Inefficient Allocation • Look for Memory Leaks
30 MemoryScape
Multi-Process and Multi-Thread
• Memory debug many processes at the same time • MPI • Client-Server • Fork-Exec • Compare two runs • Remote applications • Mutli-threaded applications
31 Script Mode - MemScript
• Automation Support • MemoryScape lets users run tests and check programs for memory leaks without having to be in front of the program • Simple command line program called MemScript • Doesn’t start up the GUI • Can be run from within a script or test harness • The user defines • What configuration options are active • What thing to look for • Actions MemoryScape should take for each type of event that may occur
32 MemoryScape
Red Zones instant array bounds detection for Linux
• Red Zones provides: • Immediate detection of memory overruns. • Detection of access violations both before and after the bounds of allocated memory. • Detection of deallocated memory accesses. • Red Zones events • MemoryScape will stop your programs execution and raise an event alerting you to the illegal access. You will be able to see exactly where your code over stepped the bounds.
• Note: Not available on the Blue Gene/Q
33 MemoryScape
Configuring Guard Blocks
Guard allocated memory When selected, the Memory Debugger writes guard blocks before and after a memory block that your program allocates
Pre-Guard and Post-Guard Size: Sets the size in bytes of the block that the Memory Debugger places immediately before and after the memory block that your program allocates
Pattern: Indicates the pattern that the Memory Debugger writes into guard blocks. The default values are 0x77777777 and 0x99999999 3 4 Memory Corruption Detection (Guard Blocks)
35 TVScript tvscript and memscript
• A straightforward language for unattended and/or batch debugging with TotalView and/or MemoryScape • Usable whenever jobs need to be submitted or batched • Can be used for automation • A more powerful version of printf, no recompilation necessary between runs • Schedule automated debug runs with cron jobs • Expand its capabilities using TCL
37 Output
• All of the following information is provided by default for each print • Process id • Thread id • Rank • Timestamp • Event/Action description • A single output file is written containing all of the information regardless of the number of processes/threads being debugged
38 Sample Output
• Simple interface to create an action point -create_actionpoint ”#85=>print foreign_addr” • Sample output with all information !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! Print ! ! Process: ! ./TVscript_demo (Debugger Process ID: 5, System ID: [email protected]) ! Thread: ! Debugger ID: 5.1, System ID: 3077191888 ! Rank: ! 0 ! Time Stamp: ! 05-14-2012 17:11:24 ! Triggered from event: ! actionpoint ! Results: ! err_detail = { ! intervals = 0x0000000a (10) ! almost_pi = 3.1424259850011 ! delta = 0.000833243988525023 ! } ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
39 Events
• General • any_event • Source code debugging events • actionpoint • error • Memory events (just a few, all are listed in Chapter 4 of TotalView Reference Guide) • any_memory_event • free_not_allocated • guard_corruption • rz_overrun, rz_underrun, rz_use_after_free
40 Actions
• Source code • display_backtrace [-level num] [numlevels] [options] • print [-slice {exp}] {variable | exp} • Memory • check_guard_blocks • list_allocations • list_leaks • save_html_heap_status_source_view • save_memory_debugging_file • save_text_heap_status_source_view
41 Command syntax
• General syntax • tvscript [options] [filename] –a [program_args] • MPI Options • -mpi starter starter comes from Parallel tab dropdown • -starter_args “args for starter program” • -nodes • -np or –procs or –tasks
42 Command syntax
• Action options • -create_actionpoint “src_expr[=>action1[,action2] …]” • Repeat on command line for each actionpoint • -event_action “event_action_list” • event1=action1,event2=action2 or event1=>action1,action2 • Can repeat on command line for multiple actions • General options • -display_specifiers “display_specifiers_list” • -maxruntime “hh:mm:ss” • -script_file scriptFile • -script_log_filename logFilename • -script_summary_log_filename summaryLogFilename
43 Command syntax
• Memory debugging options • -memory_debugging (must use for debugging memory) • -mem_detect_leaks • -mem_detect_use_after_free • -mem_guard_blocks • -mem_hoard_freed_memory • -mem_hoard_low_memory_threshold nnnn • -mem_paint_all • -mem_paint_on_alloc • -mem_paint_on_dealloc
44 Command syntax
• Memory debugging red zone options • -mem_red_zone_overruns • -mem_red_zones_size_ranges min:max[,min:max]… • Ranges can be • min:max • min: • :max • fixed • -mem_red_zones_underruns
45 Script Files
• Instead of putting everything on the command line, you can also write and use script files • Script files can also include TCL • Logging functions • tvscript_log msg – logs msg to the log file • tvscript_slog msg – logs msg to the summary log file • Property functions • tvscript_get_process_property process_id property • tvscript_get_thread_property thread_id property
46 Roadmap What is new in 8.13
• TotalView 8.13 – CUDA 5.0 and 5.5 – MemoryScape for Xeon Phi Native and Symmetric – Xeon Phi Symmetric – Mac OS X Mavericks – Early Access: Scalability Improvements
Next Release
Plans subject to change without notice
• TotalView 8.14 – CUDA 6.0 Support – Early Access Support for Save/Load of ReplayEngine recordings – GCC C++ Type Transform Support for Unordered Collections • Unordered Map • Unordered Set • Unordered Multimap • Unordered Multiset – Platform updates
© 2014 Rogue Wave Software, Inc. All Rights Reserved N+1 Release
Plans subject to change without notice
• TotalView 8.15 – CUDA 6.5 Support – Co-array Fortran support – OpenACC support in PGI – Scalability and Performance at Scale Improvements – Platform updates
© 2014 Rogue Wave Software, Inc. All Rights Reserved