Porting Brain Sim Legacy Code to GPU-Enabled
Total Page:16
File Type:pdf, Size:1020Kb
Porting Legacy Code of a Large, Complex Brain Simulation to a GPU-Enabled Cluster Chris Gottbrath, Principal Product Manager March 21st, 2013 Agenda • Who are Rogue Wave Software • HRL CNES Neural Network Simulation • Other Case Studies • Overview of TotalView for CUDA • Latest Updates and Futures • Conclusion | Copyright © 2013 Rogue Wave Software | All Rights Reserved Rogue Wave Software • Pioneers in C++/object-oriented development • Leading the way in cross- platform, parallel development • History – Founded: 1989 • Customers – Acquired by Battery Ventures: 2007 – Acquired: – 3,000+ in 36 countries • Visual Numerics: 2009 – Financial services, telecoms, oil • TotalView Technologies: 2010 and gas, government and • Acumem: 2010 aerospace, research and • ILOG Visualization Tech 2012 academic | Copyright © 2013 Rogue Wave Software | All Rights Reserved Leveraging Leading Edge Analytics Challenges Rogue Wave Solutions • Understanding the data is difficult • Build or extend the application using • Knowing when and how to apply IMSL numerical libraries mathematical and statistical methods – Cross-platform, multi-platform requires an advanced skill set – Already optimized for parallel – Which method for particular data sets deployment and situations • IMSL C Numerical Library – Must go through several rounds of • IMSL C# Numerical Library for .NET prototyping and testing Applications • Developing the algorithms is even • JMSL™ Numerical Library for Java™ more difficult Applications – Making sure they are correct • IMSL Fortran Numerical Library – Making sure they are correct when they execute in parallel environments 3 | Copyright © 2013 Rogue Wave Software, Inc. | All Rights Reserved IMSL Numerical Library Methods Mathematics Statistics • Basic Types • Basic Statistics • Linear Algebra • Time Series & Forecasting • Eigensystems • Nonparametric Tests • Interpolation & Approximation • Correlation & Covariance • Quadrature • Data Mining • Differential Equations • Regression • Nonlinear Equations • Analysis of Variance • Optimization • Transforms • Special Functions • Goodness of Fit • Finance & Bond Calculations • Distribution Functions • Genetic Algorithm • Random Number Generation • Neural Networks | Copyright © 2013 Rogue Wave Software | All Rights Reserved IMSL • IMSL C and Fortran libraries both support CUDA – BLAS Linear Algebra routines – Uses CUDA where appropriate • Just a link time change .. No code changes whatsoever! – Contact Rogue Wave for more details. | Copyright © 2013 Rogue Wave Software | All Rights Reserved IMSL Fortran GPU Benchmarks What is TotalView? Application Analysis and Debugging Tool: Code Confidently • Debug and Analyze C/C++ and • Major Features Fortran on Linux, Unix or Mac OS – Easy to learn graphical user interface with data X visualization – Parallel Debugging • Debugs host and accelerator • MPI, Pthreads, OpenMP, GA, UPC together • CUDA and OpenACC – Includes a Remote Display Client freeing you to work from • Laptops to supercomputers (BG, anywhere Cray) – Memory Debugging with MemoryScape – Deterministic Replay Capability Included on Linux/x86-64 • Makes developing, maintaining – Non-interactive Batch Debugging with TVScript and the and supporting critical apps CLI easier and less risky – TTF & C++View to transform user defined objects | Copyright © 2013 Rogue Wave Software | All Rights Reserved What is ThreadSpotter? • Runtime Cache Performance Optimization • Features Tool: Tune into the Multi-Core Era – Supports Windows and Linux x86/x86-64 – Realize More of the Performance Offered by – Any compiled code Multi/Many-Core Chips – Runtime Analysis – Quickly Detects and Prioritizes Issues -- and • Low overhead then Provides Usable Advice! – Cache Modeling • Prioritizes Issues • Identifies Problem Lines of Code – Provides Advice • Explanations • Examples • Detailed statistics (if desired) Case Study: HRL • Who: Center for Neural and Emergent Systems (CNES) at HRL • Project: Scaling up neural network technology. • Technology Chosen: Using a CUDA accelerated cluster to model the brain. • Challenge: New technologies, ambitious goals, and short timelines put everyone under pressure. They were unable to resolve bugs fast enough to meet time line • Solution: TotalView which radically shortened bug diagnosis and allowed them to meet their project goals. | Copyright © 2013 Rogue Wave Software | All Rights Reserved The Project • Cortical Simulation Project – Biomorphic Networks – Modeling the brain’s structure – Large numbers of neurons – Massive numbers of connections – Lots of short connections, fewer longer range connections • Lots of parallelism • 200 Node cluster – NVIDIA GPU acceleration 10 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Not a trivial challenge at all “One of the tasks of the CNES team at HRL is simulating large scale networks in real time. The only possible way to simulate such massive networks without moving to a full-scale supercomputer is with a large GPU cluster. “ - Kirill Minkovitch, HRL 11 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Business Challenges • Ambitious Goals – Really “scaling out” the problem in terms of the size of the simulated network – Simulation has to generate results quickly – Prohibitively expensive with traditional processors • Cutting edge technology – First major foray into CUDA based GPU programming • Demanding project schedule – Productivity is key – Bug related delays have business impact 12 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Technical Challenges • Architecture – Mapping the problem effectively to the parallelism in the hardware – Balancing computation and communication • Tools – Tried with open source tools and failed – Needed a single polished approach for MPI and CUDA • Project schedules and goals at risk 13 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Finding the right balance • Communication vs Computation • Overlapping communication was critical to achieving results – Main computation occurs on the GPU device – The host processor is used to facilitate long range communications 14 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Expressing multiple levels of parallelism • The system being modeled is fundamentally parallel • Matching that parallelism to the hardware hierarchy • Minimizing communication overhead • Frequent short range connections handled on the device • Less frequent longer range communication over MPI 15 | Copyright © 2013 Rogue Wave Software | All Rights Reserved What did they do before TotalView? • Used multiple flavors of GDB, for different tasks • Struggled with configuration, managing multiple debuggers • Wished they could easily use printf in kernels – This issue has been resolved 16 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Not adequate at all • “For a short time we tried using existing tools, but setting up the environment, even before debugging began, became more and more daunting. So we looked at the time we were devoting to debugging, and realized if we were going to meet our release deadline, something had to change.” - Kirill Minkovitch, HRL 17 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Unified Accelerated Cluster Debugging Environment • Need to debug CUDA in context of MPI – Attach to all the MPI processes that contribute to a specific job – Control MPI processes and compare their behavior • Can’t always isolate CUDA issues to a specific instance – Failures may happen anywhere within the job • Using non-MPI aware CUDA tools required heroic set up in the cluster environment 18 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Radical step forward with TotalView “In the first full day of using TotalView, we were quickly able to solve the bug that had us stumped for weeks. With TotalView we were able to step into a specific thread, and then into specific CUDA kernels to identify what went wrong. We could resolve the bugs quickly, and focus our development effort on adding features.” - Kirill Minkovitch, HRL 19 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Aside: Vote for your favorite bug! • Nominations include: • Favorite can be most frequently encountered, hardest, most memorable – Failure to Launch – Failure to check return codes • We’re doing drawings daily for 15 – Error Code 30 dollars and once at the end of the – Abuse of Malloc show for 50 dollars – Pointer Arithmetic – Seg faults • A business card or entry card – Memory alignment problems • Make sure there is an email address so – Typos/People make mistakes we can contact you 20 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Impact of using the right tools • Not just: – Easier to fix bugs – More productive • Accomplished more – Hit critical goals – Added more features – Added more ambitious features like host-side multi-threading 21 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Success • “We noticed a dramatic drop in our development cycle – what used to take us more than two weeks to develop and fully test now takes less than one week. By scaling down the development cycle we were able to add more features, even going beyond the requirements of our release cycle. Most important, we were able to focus on the performance of our code, resulting in much better utilization of our existing hardware and allowing us to scale past 100 GPUs.” 22 | Copyright © 2013 Rogue Wave Software | All Rights Reserved Not a replacement for good coding style.. • “Although using TotalView is no replacement