NVIDIA GPUs in Earth System Modelling
Thomas Bradley
© NVIDIA Corporation 2011 Agenda: GPU Developments for CWO
. Motivation for GPUs in CWO
. Parallelisation Considerations
. GPU Technology Roadmap
© NVIDIA Corporation 2011 MOTIVATION FOR GPUS IN CWO
© NVIDIA Corporation 2011 NVIDIA GPUs Power 3 of Top 5 Supercomputers
#2 : Tianhe-1A #4 : Nebulae #5 : Tsubame 2.0 7168 Tesla GPU’s 2.5 PFLOPS 4650 Tesla GPU’s 1.2 PFLOPS 4224 Tesla GPUs 1.194 PFLOPS
“We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. ” Premier Wen Jiabao Public comments acknowledging Tianhe-1A
© NVIDIA Corporation 2011 GPU Systems: More Power Efficient (for HPL)
Performance Power 2500 8
7 2000
6
1500 5
4 Gigaflops 1000 3 Megawatts
2 500 1
0 0 Tianhe-1A Jaguar Nebulae Tsubame LBNL GPU-CPU Supercomputer CPU only Supercomputer Power
© NVIDIA Corporation 2011 Comparison with Top Supercomputer K in Japan K Computer: Custom SPARC Processors Tsubame: Intel CPUs + NVIDIA Tesla
8.1 PetaFlop 1.2 PetaFlop 68,500 CPUs 2K CPUs, 4K GPUs 672 Racks 2.3x better flops/rack 44 Racks 10 Megawatt 1.06x better flop/watt 1.4 Megawatt $700 Million 2.6x better $/flop $40 Million © NVIDIA Corporation 2011 Real Science on GPUs: ASUCA NWP on Tsubame
Tsubame 2.0 Tokyo Institute of Technology
1.19 Petaflops 4,224 Tesla M2050 GPUs
3990 Tesla M2050s 145.0 Tflops SP 76.1 Tflops DP
Simulation on Tsubame 2.0, TiTech Supercomputer
© NVIDIA Corporation 2011 CWO Performance: Full GPU Approach Physics Only • WRF • COSMO (1)
Dynamics Only • COSMO (2) • ICON • NIM (sample physics) • CAM • HOMME • HIRLAM
Full GPU Approach • ASUCA • GEOS-5 • GRAPE
© NVIDIA Corporation 2011
NVIDIA Features GPUs at Conferences Supercomputing 2010 | Nov 2010 | New Orleans, LA
COSMO: GPU Considerations for Next Generation Weather Simulations Thomas Schulthess, Swiss National Supercomputing Centre (CSCS)
ASUCA: Full GPU Implementation of Weather Prediction Code on TSUBAME Supercomputer Takayuki Aoki, GSIC of Tokyo Institute of Technology (TiTech)
NIM: Using GPUs to Run Next-Generation Weather Models Mark Govett, National Oceanic and Atmospheric Administration (NOAA)
BoF: GPUs and Numerical Weather Prediction (organized by CSCS and NVIDIA) Featured organizations: TiTech (ASUCA), NASA (GEOS-5), NOAA (NIM), Cray, PGI
NVIDIA GPU Technology Conference | Sep 2010 | San Jose, CA
ASUCA: Full GPU Implementation of Weather Prediction Code on TSUBAME Supercomputer Takayuki Aoki, GSIC of Tokyo Institute of Technology (TiTech)
NIM: Using GPUs to Run Next-Generation Weather Models Mark Govett, National Oceanic and Atmospheric Administration (NOAA)
MITgcm: Designing a Geoscience Accelerator Library Accessible from High Level Languages Chris Hill, Massachusetts Institute of Technology (MIT)
© NVIDIA Corporation 2011
PARALLELISATION CONSIDERATIONS
© NVIDIA Corporation 2011 GPU Considerations for CWO Codes
. Initial efforts are mostly implicit linear solvers on GPU . If linear solver ~50% of profile time – only 2x speed-up is possible . More of application must be moved to GPUs for additional benefit . Explicit schemes – no linear algebra, no solver, operations on stencil
. Most codes are parallel and scale across multiple CPU cores . Multi-core CPUs can contribute to parallel matrix assembly, others
. Most codes use a domain decomposition parallel method . Fits GPU model very well and preserves costly MPI investment
© NVIDIA Corporation 2011 Options for Parallel Programming of GPUs
Approach Examples
Applications MATLAB, Mathematica, LabVIEW
Libraries FFT, BLAS, SPARSE, RNG, IMSL, CUSP, etc.
Directives PGI Accelerator, HMPP, Cray, F2C-Acc
Wrappers PyCUDA, CUDA.NET, jCUDA
CUDA C/C++, Languages PGI CUDA Fortran, GPU.net
APIs CUDA C/C++, OpenCL
© NVIDIA Corporation 2011 Most Implementations Focus on Dynamical Core Application Code
Dynamics Rest of Code
GPU Use CUDA to Parallelize CPU
+ © NVIDIA Corporation 2011 Sparse Iterative Solvers for Dynamics
. Sparse-matrix vector multiply (SpMV) & BLAS1 . Memory-bound
. GPU can deliver good SpMV performance . ~10-20 Gflops for unstructured matrices in double precision
. Best sparse matrix data structure on GPU different from CPU . Explore for your specific case
. A massively parallel preconditioner is key: Lectures: Jon Cohen at IMA Workshop: “Thinking parallel: sparse iterative solvers with CUDA” Nathan Bell (4-parts) at PASI: “Iterative methods for sparse linear systems on GPU”
© NVIDIA Corporation 2011 Typical Sparse Matrix Formats
Structured Unstructured
© NVIDIA Corporation 2011 Hybrid Sparse Matrix Format for GPUs
. ELL handles typical entries . COO handles exceptional entries . Implemented with segmented reduction
. Some overheads in matrix format conversion, can be hidden if the solver does O(100) of iterations
© NVIDIA Corporation 2011 SpMV Performance for Unstructured Matrices
COO CSR (scalar) CSR (vector) HYB 18 16 14
12 10 8
GFLOP/s 6 4 2 0
Flops=2*nz/t, BW = (2*sizeof(double)+size(int))/t
© NVIDIA Corporation 2011 GPU TECHNOLOGY
© NVIDIA Corporation 2011 Soul of NVIDIA’s GPU Roadmap
Increase Performance / Watt
Make Parallel Programming Easier
Run more of the Application on the GPU
© NVIDIA Corporation 2011 Tesla CUDA GPU Roadmap
Maxwell 16
14
12
per per Watt 10
8
6 Kepler DP GFLOPS DP 4 Fermi 2 T10
2008 2010 2012 2014
© NVIDIA Corporation 2011 NVIDIA Announced “Project Denver” Jan 2011
It's true folks, NVIDIA's building a CPU! Madness! The future just got a lot more exciting. http://www.engadget.com/2011/01/05/nvidia-announces-project-denver-arm-cpu-for-the-desktop/
© NVIDIA Corporation 2011 Tegra Roadmap
100
STARK
LOGAN
10 WAYNE
PERFORMANCE 5x KAL-EL Core 2 Duo
1 TEGRA 2
2010 2011 2012 2013 2014
© NVIDIA Corporation 2011 CUDA 4.0: Big Leap In Usability
CUDA 4.0 Parallel Programming Made Easy
Ease of Use CUDA 3.0 Fermi, New Libraries, CUDA 2.0 Big Perf. Boost Debugging, Profiling, CUDA 1.0 Double Precision Program GPUs using C, Libraries
Performance
© NVIDIA Corporation 2011 NVIDIA GPUDirect™: Eliminating CPU Overhead
Accelerated Communication Peer-to-Peer Communication with Network and Storage Devices Between GPUs
• Direct access to CUDA memory for 3rd • Peer-to-Peer memory access, party devices transfers & synchronization
• Eliminates unnecessary memory • Less code, higher programmer copies & CPU overhead productivity • Supported by Mellanox and Qlogic
Supported since CUDA 3.1 New in CUDA 4.0
© NVIDIA Corporation 2011 http://developer.nvidia.com/gpudirect MPI Integration of NVIDIA GPUDirect™
. MPI libraries with support for NVIDIA GPUDirect and Unified Virtual Addressing (UVA) enables: - MPI transfer primitives copy data directly to/from GPU memory - MPI library can differentiate between device memory and host memory without any hints from the user - Programmer productivity: less application code for data transfers
Code without MPI integration Code with MPI integration
At Sender: At Sender: cudaMemcpy (s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_device, size, …); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); MPI_Recv(r_device, size, …); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice);
© NVIDIA Corporation 2011 Open MPI
. Transfer data directly to/from CUDA device memory via MPI calls
. Code is currently available in the Open MPI trunk, available at: . http://www.open-mpi.org/nightly/trunk (contributed by NVIDIA)
. More details in the Open MPI FAQ . Features: http://www.open-mpi.org/faq/?category=running#mpi-cuda-support . Build Instructions: http://www.open-mpi.org/faq/?category=building#build-cuda
© NVIDIA Corporation 2011 MVAPICH2-GPU . Upcoming MVAPICH2 support for GPU-GPU communication with . Memory detection and overlap CUDA copy and RDMA transfer
With GPUDirect With GPUDirect • 45% improvement compared to Memcpy+Send (4MB) • 45% improvement compared to Memcpy+Put • 24% improvement compared to MemcpyAsync+Isend (4MB) Without GPUDirect Without GPUDirect • 39% improvement compared with Memcpy+Put • 38% improvement compared to Memcpy+send (4MB) Similar improvement for Get operation • 33% improvement compared to MemcpyAsync+Isend (4MB) Major improvement in programming Measurements from: H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, "MVAPICH2-GPU: Optimized GPU to GPU © NVIDIA Corporation 2011 Communication for InfiniBand Clusters", Int'l Supercomputing Conference 2011 (ISC), Hamburg http://mvapich.cse.ohio-state.edu/ GPUDirect: Further Information
. http://developer.nvidia.com/gpudirect
. More details including supported configurations . Instructions . System design guidelines
. Also talk to NVIDIA Solution Architects
© NVIDIA Corporation 2011 QUESTIONS
© NVIDIA Corporation 2011