NVIDIA Gpus in Earth System Modeling
Total Page:16
File Type:pdf, Size:1020Kb
NVIDIA GPUs in Earth System Modelling Thomas Bradley © NVIDIA Corporation 2011 Agenda: GPU Developments for CWO . Motivation for GPUs in CWO . Parallelisation Considerations . GPU Technology Roadmap © NVIDIA Corporation 2011 MOTIVATION FOR GPUS IN CWO © NVIDIA Corporation 2011 NVIDIA GPUs Power 3 of Top 5 Supercomputers #2 : Tianhe-1A #4 : Nebulae #5 : Tsubame 2.0 7168 Tesla GPU’s 2.5 PFLOPS 4650 Tesla GPU’s 1.2 PFLOPS 4224 Tesla GPUs 1.194 PFLOPS “We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. ” Premier Wen Jiabao Public comments acknowledging Tianhe-1A © NVIDIA Corporation 2011 GPU Systems: More Power Efficient (for HPL) Performance Power 2500 8 7 2000 6 1500 5 4 Gigaflops 1000 3 Megawatts 2 500 1 0 0 Tianhe-1A Jaguar Nebulae Tsubame LBNL GPU-CPU Supercomputer CPU only Supercomputer Power © NVIDIA Corporation 2011 Comparison with Top Supercomputer K in Japan K Computer: Custom SPARC Processors Tsubame: Intel CPUs + NVIDIA Tesla 8.1 PetaFlop 1.2 PetaFlop 68,500 CPUs 2K CPUs, 4K GPUs 672 Racks 2.3x better flops/rack 44 Racks 10 Megawatt 1.06x better flop/watt 1.4 Megawatt $700 Million 2.6x better $/flop $40 Million © NVIDIA Corporation 2011 Real Science on GPUs: ASUCA NWP on Tsubame Tsubame 2.0 Tokyo Institute of Technology 1.19 Petaflops 4,224 Tesla M2050 GPUs 3990 Tesla M2050s 145.0 Tflops SP 76.1 Tflops DP Simulation on Tsubame 2.0, TiTech Supercomputer © NVIDIA Corporation 2011 CWO Performance: Full GPU Approach Physics Only • WRF • COSMO (1) Dynamics Only • COSMO (2) • ICON • NIM (sample physics) • CAM • HOMME • HIRLAM Full GPU Approach • ASUCA • GEOS-5 • GRAPE © NVIDIA Corporation 2011 NVIDIA Features GPUs at Conferences Supercomputing 2010 | Nov 2010 | New Orleans, LA COSMO: GPU Considerations for Next Generation Weather Simulations Thomas Schulthess, Swiss National Supercomputing Centre (CSCS) ASUCA: Full GPU Implementation of Weather Prediction Code on TSUBAME Supercomputer Takayuki Aoki, GSIC of Tokyo Institute of Technology (TiTech) NIM: Using GPUs to Run Next-Generation Weather Models Mark Govett, National Oceanic and Atmospheric Administration (NOAA) BoF: GPUs and Numerical Weather Prediction (organized by CSCS and NVIDIA) Featured organizations: TiTech (ASUCA), NASA (GEOS-5), NOAA (NIM), Cray, PGI NVIDIA GPU Technology Conference | Sep 2010 | San Jose, CA ASUCA: Full GPU Implementation of Weather Prediction Code on TSUBAME Supercomputer Takayuki Aoki, GSIC of Tokyo Institute of Technology (TiTech) NIM: Using GPUs to Run Next-Generation Weather Models Mark Govett, National Oceanic and Atmospheric Administration (NOAA) MITgcm: Designing a Geoscience Accelerator Library Accessible from High Level Languages Chris Hill, Massachusetts Institute of Technology (MIT) © NVIDIA Corporation 2011 PARALLELISATION CONSIDERATIONS © NVIDIA Corporation 2011 GPU Considerations for CWO Codes . Initial efforts are mostly implicit linear solvers on GPU . If linear solver ~50% of profile time – only 2x speed-up is possible . More of application must be moved to GPUs for additional benefit . Explicit schemes – no linear algebra, no solver, operations on stencil . Most codes are parallel and scale across multiple CPU cores . Multi-core CPUs can contribute to parallel matrix assembly, others . Most codes use a domain decomposition parallel method . Fits GPU model very well and preserves costly MPI investment © NVIDIA Corporation 2011 Options for Parallel Programming of GPUs Approach Examples Applications MATLAB, Mathematica, LabVIEW Libraries FFT, BLAS, SPARSE, RNG, IMSL, CUSP, etc. Directives PGI Accelerator, HMPP, Cray, F2C-Acc Wrappers PyCUDA, CUDA.NET, jCUDA CUDA C/C++, Languages PGI CUDA Fortran, GPU.net APIs CUDA C/C++, OpenCL © NVIDIA Corporation 2011 Most Implementations Focus on Dynamical Core Application Code Dynamics Rest of Code GPU Use CUDA to Parallelize CPU + © NVIDIA Corporation 2011 Sparse Iterative Solvers for Dynamics . Sparse-matrix vector multiply (SpMV) & BLAS1 . Memory-bound . GPU can deliver good SpMV performance . ~10-20 Gflops for unstructured matrices in double precision . Best sparse matrix data structure on GPU different from CPU . Explore for your specific case . A massively parallel preconditioner is key: Lectures: Jon Cohen at IMA Workshop: “Thinking parallel: sparse iterative solvers with CUDA” Nathan Bell (4-parts) at PASI: “Iterative methods for sparse linear systems on GPU” © NVIDIA Corporation 2011 Typical Sparse Matrix Formats Structured Unstructured © NVIDIA Corporation 2011 Hybrid Sparse Matrix Format for GPUs . ELL handles typical entries . COO handles exceptional entries . Implemented with segmented reduction . Some overheads in matrix format conversion, can be hidden if the solver does O(100) of iterations © NVIDIA Corporation 2011 SpMV Performance for Unstructured Matrices COO CSR (scalar) CSR (vector) HYB 18 16 14 12 10 8 GFLOP/s 6 4 2 0 Flops=2*nz/t, BW = (2*sizeof(double)+size(int))/t © NVIDIA Corporation 2011 GPU TECHNOLOGY © NVIDIA Corporation 2011 Soul of NVIDIA’s GPU Roadmap Increase Performance / Watt Make Parallel Programming Easier Run more of the Application on the GPU © NVIDIA Corporation 2011 Tesla CUDA GPU Roadmap Maxwell 16 14 12 per per Watt 10 8 6 Kepler DP GFLOPS DP 4 Fermi 2 T10 2008 2010 2012 2014 © NVIDIA Corporation 2011 NVIDIA Announced “Project Denver” Jan 2011 It's true folks, NVIDIA's building a CPU! Madness! The future just got a lot more exciting. http://www.engadget.com/2011/01/05/nvidia-announces-project-denver-arm-cpu-for-the-desktop/ © NVIDIA Corporation 2011 Tegra Roadmap 100 STARK LOGAN 10 WAYNE PERFORMANCE 5x KAL-EL Core 2 Duo 1 TEGRA 2 2010 2011 2012 2013 2014 © NVIDIA Corporation 2011 CUDA 4.0: Big Leap In Usability CUDA 4.0 Parallel Programming Made Easy Ease of Use CUDA 3.0 Fermi, New Libraries, CUDA 2.0 Big Perf. Boost Debugging, Profiling, CUDA 1.0 Double Precision Program GPUs using C, Libraries Performance © NVIDIA Corporation 2011 NVIDIA GPUDirect™: Eliminating CPU Overhead Accelerated Communication Peer-to-Peer Communication with Network and Storage Devices Between GPUs • Direct access to CUDA memory for 3rd • Peer-to-Peer memory access, party devices transfers & synchronization • Eliminates unnecessary memory • Less code, higher programmer copies & CPU overhead productivity • Supported by Mellanox and Qlogic Supported since CUDA 3.1 New in CUDA 4.0 © NVIDIA Corporation 2011 http://developer.nvidia.com/gpudirect MPI Integration of NVIDIA GPUDirect™ . MPI libraries with support for NVIDIA GPUDirect and Unified Virtual Addressing (UVA) enables: - MPI transfer primitives copy data directly to/from GPU memory - MPI library can differentiate between device memory and host memory without any hints from the user - Programmer productivity: less application code for data transfers Code without MPI integration Code with MPI integration At Sender: At Sender: cudaMemcpy(s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_device, size, …); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); MPI_Recv(r_device, size, …); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice); © NVIDIA Corporation 2011 Open MPI . Transfer data directly to/from CUDA device memory via MPI calls . Code is currently available in the Open MPI trunk, available at: . http://www.open-mpi.org/nightly/trunk (contributed by NVIDIA) . More details in the Open MPI FAQ . Features: http://www.open-mpi.org/faq/?category=running#mpi-cuda-support . Build Instructions: http://www.open-mpi.org/faq/?category=building#build-cuda © NVIDIA Corporation 2011 MVAPICH2-GPU . Upcoming MVAPICH2 support for GPU-GPU communication with . Memory detection and overlap CUDA copy and RDMA transfer With GPUDirect With GPUDirect • 45% improvement compared to Memcpy+Send (4MB) • 45% improvement compared to Memcpy+Put • 24% improvement compared to MemcpyAsync+Isend (4MB) Without GPUDirect Without GPUDirect • 39% improvement compared with Memcpy+Put • 38% improvement compared to Memcpy+send (4MB) Similar improvement for Get operation • 33% improvement compared to MemcpyAsync+Isend (4MB) Major improvement in programming Measurements from: H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, "MVAPICH2-GPU: Optimized GPU to GPU © NVIDIA Corporation 2011 Communication for InfiniBand Clusters", Int'l Supercomputing Conference 2011 (ISC), Hamburg http://mvapich.cse.ohio-state.edu/ GPUDirect: Further Information . http://developer.nvidia.com/gpudirect . More details including supported configurations . Instructions . System design guidelines . Also talk to NVIDIA Solution Architects © NVIDIA Corporation 2011 QUESTIONS © NVIDIA Corporation 2011 .