GPUs in Earth System Modelling

Thomas Bradley

© NVIDIA Corporation 2011 Agenda: GPU Developments for CWO

. Motivation for GPUs in CWO

. Parallelisation Considerations

. GPU Technology Roadmap

© NVIDIA Corporation 2011 MOTIVATION FOR GPUS IN CWO

© NVIDIA Corporation 2011 NVIDIA GPUs Power 3 of Top 5

#2 : Tianhe-1A #4 : Nebulae #5 : Tsubame 2.0 7168 Tesla GPU’s 2.5 PFLOPS 4650 Tesla GPU’s 1.2 PFLOPS 4224 Tesla GPUs 1.194 PFLOPS

“We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. ” Premier Wen Jiabao Public comments acknowledging Tianhe-1A

© NVIDIA Corporation 2011 GPU Systems: More Power Efficient (for HPL)

Performance Power 2500 8

7 2000

6

1500 5

4 Gigaflops 1000 3 Megawatts

2 500 1

0 0 Tianhe-1A Jaguar Nebulae Tsubame LBNL GPU-CPU CPU only Supercomputer Power

© NVIDIA Corporation 2011 Comparison with Top Supercomputer K in Japan K Computer: Custom SPARC Processors Tsubame: CPUs +

8.1 PetaFlop 1.2 PetaFlop 68,500 CPUs 2K CPUs, 4K GPUs 672 Racks 2.3x better /rack 44 Racks 10 Megawatt 1.06x better flop/watt 1.4 Megawatt $700 Million 2.6x better $/flop $40 Million © NVIDIA Corporation 2011 Real Science on GPUs: ASUCA NWP on Tsubame

Tsubame 2.0 Tokyo Institute of Technology

1.19 Petaflops 4,224 Tesla M2050 GPUs

3990 Tesla M2050s 145.0 Tflops SP 76.1 Tflops DP

Simulation on Tsubame 2.0, TiTech Supercomputer

© NVIDIA Corporation 2011 CWO Performance: Full GPU Approach Physics Only • WRF • COSMO (1)

Dynamics Only • COSMO (2) • ICON • NIM (sample physics) • CAM • HOMME • HIRLAM

Full GPU Approach • ASUCA • GEOS-5 • GRAPE

© NVIDIA Corporation 2011

NVIDIA Features GPUs at Conferences Supercomputing 2010 | Nov 2010 | New Orleans, LA

COSMO: GPU Considerations for Next Generation Weather Simulations Thomas Schulthess, Swiss National Supercomputing Centre (CSCS)

ASUCA: Full GPU Implementation of Weather Prediction Code on TSUBAME Supercomputer Takayuki Aoki, GSIC of Tokyo Institute of Technology (TiTech)

NIM: Using GPUs to Run Next-Generation Weather Models Mark Govett, National Oceanic and Atmospheric Administration (NOAA)

BoF: GPUs and Numerical Weather Prediction (organized by CSCS and NVIDIA) Featured organizations: TiTech (ASUCA), NASA (GEOS-5), NOAA (NIM), Cray, PGI

NVIDIA GPU Technology Conference | Sep 2010 | San Jose, CA

ASUCA: Full GPU Implementation of Weather Prediction Code on TSUBAME Supercomputer Takayuki Aoki, GSIC of Tokyo Institute of Technology (TiTech)

NIM: Using GPUs to Run Next-Generation Weather Models Mark Govett, National Oceanic and Atmospheric Administration (NOAA)

MITgcm: Designing a Geoscience Accelerator Library Accessible from High Level Languages Chris Hill, Massachusetts Institute of Technology (MIT)

© NVIDIA Corporation 2011

PARALLELISATION CONSIDERATIONS

© NVIDIA Corporation 2011 GPU Considerations for CWO Codes

. Initial efforts are mostly implicit linear solvers on GPU . If linear solver ~50% of profile time – only 2x speed-up is possible . More of application must be moved to GPUs for additional benefit . Explicit schemes – no linear algebra, no solver, operations on stencil

. Most codes are parallel and scale across multiple CPU cores . Multi-core CPUs can contribute to parallel matrix assembly, others

. Most codes use a domain decomposition parallel method . Fits GPU model very well and preserves costly MPI investment

© NVIDIA Corporation 2011 Options for Parallel Programming of GPUs

Approach Examples

Applications MATLAB, Mathematica, LabVIEW

Libraries FFT, BLAS, SPARSE, RNG, IMSL, CUSP, etc.

Directives PGI Accelerator, HMPP, Cray, F2C-Acc

Wrappers PyCUDA, CUDA.NET, jCUDA

CUDA C/C++, Languages PGI CUDA Fortran, GPU.net

APIs CUDA C/C++, OpenCL

© NVIDIA Corporation 2011 Most Implementations Focus on Dynamical Core Application Code

Dynamics Rest of Code

GPU Use CUDA to Parallelize CPU

+ © NVIDIA Corporation 2011 Sparse Iterative Solvers for Dynamics

. Sparse-matrix vector multiply (SpMV) & BLAS1 . Memory-bound

. GPU can deliver good SpMV performance . ~10-20 Gflops for unstructured matrices in double precision

. Best sparse matrix data structure on GPU different from CPU . Explore for your specific case

. A massively parallel preconditioner is key: Lectures: Jon Cohen at IMA Workshop: “Thinking parallel: sparse iterative solvers with CUDA” Nathan Bell (4-parts) at PASI: “Iterative methods for sparse linear systems on GPU”

© NVIDIA Corporation 2011 Typical Sparse Matrix Formats

Structured Unstructured

© NVIDIA Corporation 2011 Hybrid Sparse Matrix Format for GPUs

. ELL handles typical entries . COO handles exceptional entries . Implemented with segmented reduction

. Some overheads in matrix format conversion, can be hidden if the solver does O(100) of iterations

© NVIDIA Corporation 2011 SpMV Performance for Unstructured Matrices

COO CSR (scalar) CSR (vector) HYB 18 16 14

12 10 8

GFLOP/s 6 4 2 0

Flops=2*nz/t, BW = (2*sizeof(double)+size(int))/t

© NVIDIA Corporation 2011 GPU TECHNOLOGY

© NVIDIA Corporation 2011 Soul of NVIDIA’s GPU Roadmap

Increase Performance / Watt

Make Parallel Programming Easier

Run more of the Application on the GPU

© NVIDIA Corporation 2011 Tesla CUDA GPU Roadmap

Maxwell 16

14

12

per per Watt 10

8

6 Kepler DP GFLOPS DP 4 Fermi 2 T10

2008 2010 2012 2014

© NVIDIA Corporation 2011 NVIDIA Announced “Project Denver” Jan 2011

It's true folks, NVIDIA's building a CPU! Madness! The future just got a lot more exciting. http://www.engadget.com/2011/01/05/nvidia-announces-project-denver-arm-cpu-for-the-desktop/

© NVIDIA Corporation 2011 Roadmap

100

STARK

LOGAN

10 WAYNE

PERFORMANCE 5x KAL-EL Core 2 Duo

1 TEGRA 2

2010 2011 2012 2013 2014

© NVIDIA Corporation 2011 CUDA 4.0: Big Leap In Usability

CUDA 4.0 Parallel Programming Made Easy

Ease of Use CUDA 3.0 Fermi, New Libraries, CUDA 2.0 Big Perf. Boost Debugging, Profiling, CUDA 1.0 Double Precision Program GPUs using C, Libraries

Performance

© NVIDIA Corporation 2011 NVIDIA GPUDirect™: Eliminating CPU Overhead

Accelerated Communication Peer-to-Peer Communication with Network and Storage Devices Between GPUs

• Direct access to CUDA memory for 3rd • Peer-to-Peer memory access, party devices transfers & synchronization

• Eliminates unnecessary memory • Less code, higher programmer copies & CPU overhead productivity • Supported by Mellanox and Qlogic

Supported since CUDA 3.1 New in CUDA 4.0

© NVIDIA Corporation 2011 http://developer.nvidia.com/gpudirect MPI Integration of NVIDIA GPUDirect™

. MPI libraries with support for NVIDIA GPUDirect and Unified Virtual Addressing (UVA) enables: - MPI transfer primitives copy data directly to/from GPU memory - MPI library can differentiate between device memory and host memory without any hints from the user - Programmer productivity: less application code for data transfers

Code without MPI integration Code with MPI integration

At Sender: At Sender: cudaMemcpy (s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_device, size, …); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); MPI_Recv(r_device, size, …); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice);

© NVIDIA Corporation 2011 Open MPI

. Transfer data directly to/from CUDA device memory via MPI calls

. Code is currently available in the Open MPI trunk, available at: . http://www.open-mpi.org/nightly/trunk (contributed by NVIDIA)

. More details in the Open MPI FAQ . Features: http://www.open-mpi.org/faq/?category=running#mpi-cuda-support . Build Instructions: http://www.open-mpi.org/faq/?category=building#build-cuda

© NVIDIA Corporation 2011 MVAPICH2-GPU . Upcoming MVAPICH2 support for GPU-GPU communication with . Memory detection and overlap CUDA copy and RDMA transfer

With GPUDirect With GPUDirect • 45% improvement compared to Memcpy+Send (4MB) • 45% improvement compared to Memcpy+Put • 24% improvement compared to MemcpyAsync+Isend (4MB) Without GPUDirect Without GPUDirect • 39% improvement compared with Memcpy+Put • 38% improvement compared to Memcpy+send (4MB) Similar improvement for Get operation • 33% improvement compared to MemcpyAsync+Isend (4MB) Major improvement in programming Measurements from: H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, "MVAPICH2-GPU: Optimized GPU to GPU © NVIDIA Corporation 2011 Communication for InfiniBand Clusters", Int'l Supercomputing Conference 2011 (ISC), Hamburg http://mvapich.cse.ohio-state.edu/ GPUDirect: Further Information

. http://developer.nvidia.com/gpudirect

. More details including supported configurations . Instructions . System design guidelines

. Also talk to NVIDIA Solution Architects

© NVIDIA Corporation 2011 QUESTIONS

© NVIDIA Corporation 2011