<<

GPU based computing

Dairsie Latimer, Petapath, UK Petapath

© Corporation 2010 About Petapath

Petapath Founded in 2008 to focus on delivering innovative hardware and software solutions into the high performance computing (HPC) markets

Partnered with HP and SGI to deliverer two Petascale prototype systems as part of the PRACE WP8 programme The system is a testbed for new ideas in usability, and efficiency of large installations

Active in exploiting emerging standards for acceleration technologies and are members of Khronos group and sit on the OpenCL working committee

We also provide consulting expertise for companies wishing to explore the advantages offered by heterogeneous systems

© NVIDIA Corporation 2010 What is Heterogeneous or GPU Computing?

PCIe bus GPU

Computing with CPU + GPU Heterogeneous Computing © NVIDIA Corporation 2010 Low Latency or High Throughput?

CPU GPU Optimised for low-latency Optimised for data-parallel, access to cached data sets throughput computation Control logic for out-of-order Architecture tolerant of and speculative execution memory latency More transistors dedicated to computation

© NVIDIA Corporation 2010 NVIDIA GPU Computing Ecosystem ISV

CUDA CUDA TPP / OEM Training Development Company Specialist Hardware GPU Architecture Architect VAR

CUDA SDK & Tools

Customer Application Customer NVIDIA Hardware Requirements Solutions

Hardware Architecture

© NVIDIA Corporation 2010 Deployment Science is Desperate for Throughput

Gigaflops

1,000,000,000 1 Exaflop

1,000,000 1 Petaflop

1,000

Ran for 8 months to 1 simulate 2 nanoseconds

1982 1997 2003 2006 2010 2012 © NVIDIA Corporation 2010 Power Crisis in Supercomputing

Household Power Equivalent Exaflop City

Petaflop Town

Teraflop Neighborhood

Gigaflop Block 1982 1996 2008 2020

© NVIDIA Corporation 2010 Enter the GPU

GeForce® Tesla TM ® Entertainment High-Performance Computing Design & Creation

NVIDIA GPU Product Families © NVIDIA Corporation 2010 NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’

© NVIDIA Corporation 2010 Introducing the ‘Fermi’ Tesla Architecture The Soul of a in the body of a GPU 3 billion transistors Up to 2× the cores (C2050 has 448) Up to 8× the peak DP performance ECC on all memories L1 and L2 caches Improved memory bandwidth (GDDR5)

Giga Up to 1 Terabyte of GPU memory Concurrent kernels Hardware support for C++

© NVIDIA Corporation 2010 Design Goal of Fermi

Data Expand Parallel performance sweet spot of the GPU

Bring more users, Instruction more applications Parallel to the GPU

Many Decisions Large Data Sets © NVIDIA Corporation 2010 Streaming Multiprocessor Architecture

32 CUDA cores per SM (512 total)

8× peak double precision floating point performance 50% of peak single precision

Dual Thread Scheduler

64 KB of RAM for Load/Store Units × 16 and L1 cache (configurable) Special Func Units × 4

© NVIDIA Corporation 2010 CUDA Core Architecture

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Fused multiply-add (FMA) instruction for both single and double precision

New integer ALU optimized for 64-bit and extended precision operations FP Unit INT Unit Load/Store Units x 16 Special Func Units x 4

© NVIDIA Corporation 2010 Cached

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

L1 Cache per SM (32 cores) Improves bandwidth and reduces latency

Unified L2 Cache (768 KB) Fast, coherent data sharing across all cores in the GPU

Parallel DataCache™ Memory Hierarchy Giga Thread

© NVIDIA Corporation 2010 Larger, Faster, Resilient Memory Interface

GDDR5 memory interface 2× signaling speed of GDDR3

Up to 1 Terabyte of memory attached to GPU Giga Thread Operate on larger data sets (3 and 6 GB Cards)

ECC protection for GDDR5 DRAM

All major internal memories are ECC protected Register file, L1 cache, L2 cache

© NVIDIA Corporation 2010 GigaThread Hardware Thread Scheduler

© NVIDIA Corporation 2010 GigaThread Transfer Engine

Dual DMA engines Simultaneous CPUGPU and GPUCPU data transfer Fully overlapped with CPU and GPU processing time

SDT

Activity Snapshot:

Kernel 0 SDT0 SDT1 Kernel 1 SDT0 SDT1 Kernel 2 SDT0 SDT1 Kernel 3 SDT0 SDT1

© NVIDIA Corporation 2010 Enhanced Software Support

Many new features in CUDA Toolkit 3.0 To be released on Friday

Including early support for the Fermi architecture: Native 64-bit GPU support Multiple Copy Engine support ECC reporting Concurrent Kernel Execution Fermi HW debugging support in -gdb

© NVIDIA Corporation 2010 Enhanced Software Support

OpenCL 1.0 Support First class language citizen in CUDA Architecture Supports ICD (so interoperability between vendors is a possibility) Profiling support available Debug support coming to Parallel Nsight (NEXUS) soon

gDebugger CL from graphicREMEDY Third party OpenCL profiler/debugger/memory checker

Software Tools Ecosystem is starting to grow Given by existence of OpenCL

© NVIDIA Corporation 2010 “Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is "expected to be 10-times more powerful than today's fastest supercomputer."

Since ORNL's Jaguar supercomputer, for all intents and purposes, holds that title, and is in the of being upgraded to 2.3 PFlops….

…we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”

September 30 2009

© NVIDIA Corporation 2010 PRODUCTS

© NVIDIA Corporation 2010 Tesla GPU Computing Products: 10 Series

SuperMicro 1U Tesla S1070 Tesla C1060 Tesla Personal GPU SuperServer 1U System Computing Board Supercomputer

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision 1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops Performance Double Precision 156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops Performance Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

© NVIDIA Corporation 2010 Tesla GPU Computing Products: 20 Series

Tesla S2050 Tesla S2070 Tesla C2050 Tesla C2070 1U System 1U System Computing Board Computing Board

GPUs 4 Tesla GPUs 1 Tesla GPU Double Precision 2.1 – 2.5 Teraflops 500+ Gigaflops Performance Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB

© NVIDIA Corporation 2010 HETEROGENEOUS CLUSTERS

© NVIDIA Corporation 2010 Data Centers: Space and Energy Limited

Traditional Data Quad-core 1000’s of cores Center Cluster CPU 1000’s of servers 8 cores per 2x Performance requires 2x Number of Servers

Heterogeneous Data 10,000’s of cores Center Cluster 100’s of servers

Augment/replace host servers

© NVIDIA Corporation 2010 Cluster Deployment

Now a number of GPU aware Cluster Management Systems ActiveEon ProActive Parallel Suite® Version 4.2 Platform Cluster Manager and HPC Workgroup Streamline Computing GPU Environment (SCGE)

• Not just installation aids i.e. putting the driver and toolkits in the right place now starting to provide GPU discovery and job steering

NVIDIA and Mellanox Better interop. between Mellanox IF adapters and NVIDIA Tesla GPUs Can provide as much as a 30% performance improvement by eliminating unnecessary data movement in a multi node heterogeneous application

© NVIDIA Corporation 2010 Cluster Deployment

A number of cluster and distributed debug tools now support CUDA and NVIDIA Tesla

Allinea® DDT for NVIDIA CUDA Extends well known Distributed Debugging Tool (DDT) with CUDA support

TotalView® debugger (part of an Early Experience Program) Extends with CUDA support, have also announced intentions to support OpenCL

Both based on the Parallel Nsight (NEXUS) Debugging API

© NVIDIA Corporation 2010 NVIDIA Reality Server 3.0

platform for running 3D web applications

Consists of an Tesla RS GPU-based server cluster running RealityServer software from mental images

Deployed in a number of different sizes From 2 – 100’s of 1U Servers

iray® - Interactive Photorealistic Rendering Technology Streams interactive 3D applications to any web connected device Designers and architects can now share and visualize complex 3D models under different lighting and environmental conditions

© NVIDIA Corporation 2010 PROJECTS

© NVIDIA Corporation 2010 Distributed Computing Projects

Traditional distributed computing projects have been making use of GPUs for some time (non-commercial) Typically have 000’s to 10,000’s of contributors Folding@Home has access to 6.5 PFLOPS of compute Of which ~95% comes from GPUs or PS3s

Many are bio-informatics, molecular dynamics and quantum chemistry codes Represent the current sweet spot applications

Ubiquity of GPUs in home systems helps

© NVIDIA Corporation 2010 Distributed Computing Projects

Folding@Home Directed by Prof. Vijay Pande at Stanford University (http://folding.stanford.edu/)

Most recent GPU3 Core based on OpenMM 1.0 (https://simtk.org/home/openmm) OpenMM library provides tools for molecular modeling simulation Can be hooked into any MM application, allowing that code to do molecular modeling with minimal extra effort OpenMM has a strong emphasis on providing not just a consistent API, but much greater performance

Current NVIDIA target is via CUDA Toolkit 2.3

OpenMM 1.0 also provides Beta support for OpenCL

OpenCL is long term convergence software platform

© NVIDIA Corporation 2010 Distributed Computing Projects

Berkeley Open Infrastructure for Network Computing BOINC project (http://boinc.berkeley.edu/) Platform infrastructure originally evolved from SETI@home

Many projects use BOINC and several of these have heterogeneous compute implementations (http://boinc.berkeley.edu/wiki/GPU_computing) Examples include: GPUGRID.net SETI@home Milkyway@home (IEEE 754 Double precision capable GPU required) AQUA@home Lattice Collatz Conjecture

© NVIDIA Corporation 2010 Distributed Computing Projects

GPUGRID.net Dr. Gianni De Fabritiis, Research Group of Biomedical Informatics University Pompeu Fabra-IMIM, Barcelona

Uses GPUs to deliver high-performance all- biomolecular simulation of proteins using ACEMD (http://multiscalelab.org/acemd) ACEMD is a production bio-molecular dynamics code specially optimized to run on graphics processing units (GPUs) from NVIDIA It reads CHARMM/NAMD and AMBER input files with a simple and powerful configuration interface

A commercial implementation of ACEMD is available from Acellera Ltd ( http://www.acellera.com/acemd/) What makes this particularly interesting is that it is implemented using OpenCL

© NVIDIA Corporation 2010 Distributed Computing Projects

Have had to use brute force methods to deal with robustness Run the same WU with multiple users and compare results

Running on purpose designed heterogeneous grids with ECC Means that some of the paranoia can be relaxed (can at least detect there have been soft errors or WU corruption) Results in better throughput on these systems

But does result in divergence between Consumer and HPC devices Should be compensated for by HPC class devices being about 4x faster

© NVIDIA Corporation 2010 Tesla Bio Workbench Accelerating New Science

January, 2010

http://www.nvidia.com/bio_workbench

© NVIDIA Corporation 2010 Introducing Tesla Bio WorkBench

TeraChem

LAMMPS GPU-AutoDock

MUMmerGPU

Download, Technical Discussion Benchmarks Documentation papers Forums & Configurations

Tesla Personal Supercomputer Tesla GPU Clusters

© NVIDIA Corporation 2010 Tesla Bio Workbench Applications

AMBER (MD) Docking ACEMD (MD) GPU AutoDock GROMACS (MD) Sequence analysis GROMOS (MD) CUDASW++ (SmithWaterman) MUMmerGPU LAMMPS (MD) GPU-HMMER NAMD (MD) CUDA-MEME Motif Discovery TeraChem (QC) VMD (Visualization MD & QC)

© NVIDIA Corporation 2010 Recommended Hardware Configurations

Tesla Personal Supercomputer Tesla GPU Clusters

Up to 4 Tesla C1060s per Tesla S1070 1U workstation 4 GPUs per 1U 4GB main memory / GPU Integrated CPU-GPU Server 2 GPUs per 1U + 2 CPUs

Specifics at http://www.nvidia.com/bio_workbench © NVIDIA Corporation 2010 Molecular Dynamics and Quantum Chemistry Applications

© NVIDIA Corporation 2010 Molecular Dynamics and Quantum Chemistry Applications

AMBER (MD) LAMMPS (MD) ACEMD (MD) NAMD (MD) HOOMD (MD) TeraChem (QC) GROMACS (MD) VMD (Viz. MD & QC)

Typical speed ups of 3-8x on a single Tesla C1060 vs Modern 1U Some applications (compute bound) show 20-100x speed ups

© NVIDIA Corporation 2010 Usage of TeraGrid National Supercomputing Grid

Half of the cycles

© NVIDIA Corporation 2010 Summary

© NVIDIA Corporation 2010 Summary

‘Fermi’ debuts HPC/Enterprise features Particularly ECC and high performance double precision

Software development environments are now more mature Significant software ecosystem is starting to emerge Broadening availability of development tools, libraries and applications Heterogeneous (GPU) aware cluster management systems

Economics, open standards and improving programming methodologies Heterogeneous computing is gradually changing long held perception that it is just an ‘exotic’ niche technology

© NVIDIA Corporation 2010 Questions?

© NVIDIA Corporation 2010 Supporting Slides

© NVIDIA Corporation 2010 AMBER Molecular Dynamics

Alpha • Generalized Born Generalized Born Simulations now • PME: Particle Mesh Ewald Q1 2010 • Beta release • Multi-GPU + MPI support Q2 2010 • Beta 2 release

7x 8.6x Implicit solvent GB results 1 Tesla GPU 8x faster than 2 quad-core CPUs

More Info http://www.nvidia.com/object/amber_on_tesla.html Data courtesy of San Diego Supercomputing Center

© NVIDIA Corporation 2010 GROMACS Molecular Dynamics

Beta • Particle Mesh Ewald (PME) GROMACS on Tesla GPU Vs CPU now • Implicit solvent GB • Arbitrary forms of non- Reaction-Field Particle-Mesh-Ewald bonded interactions Cutoffs (PME) 22x • Multi-GPU + MPI support 3.5x Q2 2010 • Beta 2 release 5.2x PME results 1 Tesla GPU 3.5x-4.7x faster than CPU

More Info http://www.nvidia.com/object/gromacs_on_tesla.html Data courtesy of Stockholm Center for Biomembrane Research

© NVIDIA Corporation 2010 HOOMD Blue Molecular Dynamics

Written bottom-up for CUDA GPUs Modeled after LAMMPS Supports multiple GPUs

1 Tesla GPU outperforms 32 CPUs running LAMMPS

More Info http://www.nvidia.com/object/hoomd_on_tesla.html © NVIDIA Corporation 2010 Data courtesy of University of Michigan LAMMPS: Molecular Dynamics on a GPU Cluster

Available as beta on CUDA Cut-off based non-bonded terms 2 GPUs outperforms 24 CPUs PME based electrostatic

Preliminary results: 5X speed-up 2 GPUs = 24 CPUs Multiple GPU + MPI support enabled

More Info http://www.nvidia.com/object/lammps_on_tesla.html Data courtesy of Scott Hampton & Pratul K. Agarwal © NVIDIA Corporation 2010 Oak Ridge National Laboratory NAMD: Scaling Molecular Dynamics on a GPU Cluster

Feature complete on CUDA : available in NAMD 2.7 Beta 2 Full electrostatics with PME Multiple time-stepping 1-4 Exclusions 4 GPUs = 16 CPUs 4 GPU Tesla PSC outperforms 8 CPU servers

Scales to a GPU cluster

More Info http://www.nvidia.com/object/namd_on_tesla.html

© NVIDIA Corporation 2010 Data courtesy of Theoretical and Computational Bio-physics Group, UIUC TeraChem: Quantum Chemistry Package for GPUs

Beta • HF, Kohn-Sham, DFT now • Multiple GPUs supported • Full release Q1 2010 • MPI support

First QC SW written ground-up for GPUs

4 Tesla GPUs outperform 256 quad- core CPUs

More Info http://www.nvidia.com/object/terachem_on_tesla.html

© NVIDIA Corporation 2010 VMD: Acceleration using CUDA GPUs

Several CUDA applications in VMD 1.8.7 Molecular Orbital Display Coulomb-based Ion Placement Implicit Ligand Sampling Speedups : 20x - 100x

Multiple GPU support enabled

More Info http://www.nvidia.com/object/vmd_on_tesla.html Images and data courtesy of Beckman Institute for Advanced Science and Technology, UIUC

© NVIDIA Corporation 2010 GPU-HMMER: Protein Sequence Alignment

Protein sequence alignment using profile HMMs GPUs CPU Available now Supports multiple GPUs

Speedups range from 60-100x faster than CPU

Download http://www.mpihmmer.org/releases.htm

© NVIDIA Corporation 2010 MUMmerGPU: Genome Sequence Alignment

High-throughput pair-wise local sequence alignment Designed for large sequences

Drop-in replacement for “mummer” component in MUMmer software

Speedups 3.5x to 3.75x Download http://mummergpu.sourceforge.net

© NVIDIA Corporation 2010