GPU Based Cloud Computing
Total Page:16
File Type:pdf, Size:1020Kb
GPU based cloud computing Dairsie Latimer, Petapath, UK Petapath © NVIDIA Corporation 2010 About Petapath Petapath ! " Founded in 2008 to focus on delivering innovative hardware and software solutions into the high performance computing (HPC) markets ! " Partnered with HP and SGI to deliverer two Petascale prototype systems as part of the PRACE WP8 programme ! " The system is a testbed for new ideas in usability, scalability and efficiency of large computer installations ! " Active in exploiting emerging standards for acceleration technologies and are members of Khronos group and sit on the OpenCL working committee ! " We also provide consulting expertise for companies wishing to explore the advantages offered by heterogeneous systems © NVIDIA Corporation 2010 What is Heterogeneous or GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing © NVIDIA Corporation 2010 Low Latency or High Throughput? CPU GPU ! " Optimised for low-latency ! " Optimised for data-parallel, access to cached data sets throughput computation ! " Control logic for out-of-order ! " Architecture tolerant of and speculative execution memory latency ! " More transistors dedicated to computation © NVIDIA Corporation 2010 NVIDIA GPU Computing Ecosystem ISV CUDA CUDA TPP / OEM Training Development Company Specialist Hardware GPU Architecture Architect VAR CUDA SDK & Tools Customer Application Customer NVIDIA Hardware Requirements Solutions Hardware Architecture © NVIDIA Corporation 2010 Deployment Science is Desperate for Throughput Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 Petaflop 1,000 Ran for 8 months to 1 simulate 2 nanoseconds 1982 1997 2003 2006 2010 2012 © NVIDIA Corporation 2010 Power Crisis in Supercomputing Household Power Equivalent Exaflop City Petaflop Town Teraflop Neighborhood Gigaflop Block 1982 1996 2008 2020 © NVIDIA Corporation 2010 Enter the GPU GeForce® Tesla TM Quadro® Entertainment High-Performance Computing Design & Creation NVIDIA GPU Product Families © NVIDIA Corporation 2010 NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’ © NVIDIA Corporation 2010 Introducing the ‘Fermi’ Tesla Architecture The Soul of a Supercomputer in the body of a GPU ! " 3 billion transistors ! " Up to 2× the cores (C2050 has 448) ! " Up to 8× the peak DP performance ! " ECC on all memories ! " L1 and L2 caches ! " Improved memory bandwidth (GDDR5) Giga Thread ! " Up to 1 Terabyte of GPU memory ! " Concurrent kernels ! " Hardware support for C++ © NVIDIA Corporation 2010 Design Goal of Fermi Data ! " Expand Parallel performance sweet spot of the GPU ! " Bring more users, Instruction more applications Parallel to the GPU Many Decisions Large Data Sets © NVIDIA Corporation 2010 Streaming Multiprocessor Architecture ! " 32 CUDA cores per SM (512 total) ! " 8× peak double precision floating point performance ! " 50% of peak single precision ! " Dual Thread Scheduler ! " 64 KB of RAM for shared memory Load/Store Units × 16 and L1 cache (configurable) Special Func Units × 4 © NVIDIA Corporation 2010 CUDA Core Architecture ! " New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs ! " Fused multiply-add (FMA) instruction for both single and double precision ! " New integer ALU optimized for 64-bit and extended precision operations FP Unit INT Unit Load/Store Units x 16 Special Func Units x 4 © NVIDIA Corporation 2010 Cached Memory Hierarchy ! " First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory ! " L1 Cache per SM (32 cores) ! " Improves bandwidth and reduces latency ! " Unified L2 Cache (768 KB) ! " Fast, coherent data sharing across all cores in the GPU Parallel DataCache™ Memory Hierarchy Giga Thread © NVIDIA Corporation 2010 Larger, Faster, Resilient Memory Interface ! " GDDR5 memory interface ! " 2× signaling speed of GDDR3 ! " Up to 1 Terabyte of memory attached to GPU Giga Thread ! " Operate on larger data sets (3 and 6 GB Cards) ! " ECC protection for GDDR5 DRAM ! " All major internal memories are ECC protected ! " Register file, L1 cache, L2 cache © NVIDIA Corporation 2010 GigaThread Hardware Thread Scheduler © NVIDIA Corporation 2010 GigaThread Streaming Data Transfer Engine ! " Dual DMA engines ! " Simultaneous CPUGPU and GPUCPU data transfer ! " Fully overlapped with CPU and GPU processing time SDT ! " Activity Snapshot: Kernel 0 SDT0 SDT1 Kernel 1 SDT0 SDT1 Kernel 2 SDT0 SDT1 Kernel 3 SDT0 SDT1 © NVIDIA Corporation 2010 Enhanced Software Support ! " Many new features in CUDA Toolkit 3.0 ! " To be released on Friday ! " Including early support for the Fermi architecture: ! " Native 64-bit GPU support ! " Multiple Copy Engine support ! " ECC reporting ! " Concurrent Kernel Execution ! " Fermi HW debugging support in cuda-gdb © NVIDIA Corporation 2010 Enhanced Software Support ! " OpenCL 1.0 Support ! " First class language citizen in CUDA Architecture ! " Supports ICD (so interoperability between vendors is a possibility) ! " Profiling support available ! " Debug support coming to Parallel Nsight (NEXUS) soon ! " gDebugger CL from graphicREMEDY ! " Third party OpenCL profiler/debugger/memory checker ! " Software Tools Ecosystem is starting to grow ! " Given boost by existence of OpenCL © NVIDIA Corporation 2010 “Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is "expected to be 10-times more powerful than today's fastest supercomputer." Since ORNL's Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 PFlops…. …we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.” September 30 2009 © NVIDIA Corporation 2010 NVIDIA TESLA PRODUCTS © NVIDIA Corporation 2010 Tesla GPU Computing Products: 10 Series SuperMicro 1U Tesla S1070 Tesla C1060 Tesla Personal GPU SuperServer 1U System Computing Board Supercomputer GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUs Single Precision 1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops Performance Double Precision 156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops Performance Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU) © NVIDIA Corporation 2010 Tesla GPU Computing Products: 20 Series Tesla S2050 Tesla S2070 Tesla C2050 Tesla C2070 1U System 1U System Computing Board Computing Board GPUs 4 Tesla GPUs 1 Tesla GPU Double Precision 2.1 – 2.5 Teraflops 500+ Gigaflops Performance Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB © NVIDIA Corporation 2010 HETEROGENEOUS CLUSTERS © NVIDIA Corporation 2010 Data Centers: Space and Energy Limited Traditional Data Quad-core 1000’s of cores Center Cluster CPU 1000’s of servers 8 cores per server 2x Performance requires 2x Number of Servers Heterogeneous Data 10,000’s of cores Center Cluster 100’s of servers Augment/replace host servers © NVIDIA Corporation 2010 Cluster Deployment ! " Now a number of GPU aware Cluster Management Systems ! " ActiveEon ProActive Parallel Suite® Version 4.2 ! " Platform Cluster Manager and HPC Workgroup ! " Streamline Computing GPU Environment (SCGE) •" Not just installation aids ! " i.e. putting the driver and toolkits in the right place ! " now starting to provide GPU node discovery and job steering ! " NVIDIA and Mellanox ! " Better interop. between Mellanox IF adapters and NVIDIA Tesla GPUs ! " Can provide as much as a 30% performance improvement by eliminating unnecessary data movement in a multi node heterogeneous application © NVIDIA Corporation 2010 Cluster Deployment ! " A number of cluster and distributed debug tools now support CUDA and NVIDIA Tesla ! " Allinea® DDT for NVIDIA CUDA ! " Extends well known Distributed Debugging Tool (DDT) with CUDA support ! " TotalView® debugger (part of an Early Experience Program) ! " Extends with CUDA support, have also announced intentions to support OpenCL ! " Both based on the Parallel Nsight (NEXUS) Debugging API © NVIDIA Corporation 2010 NVIDIA Reality Server 3.0 ! " Cloud computing platform for running 3D web applications ! " Consists of an Tesla RS GPU-based server cluster running RealityServer software from mental images ! " Deployed in a number of different sizes ! " From 2 – 100’s of 1U Servers ! " iray® - Interactive Photorealistic Rendering Technology ! " Streams interactive 3D applications to any web connected device ! " Designers and architects can now share and visualize complex 3D models under different lighting and environmental conditions © NVIDIA Corporation 2010 DISTRIBUTED COMPUTING PROJECTS © NVIDIA Corporation 2010 Distributed Computing Projects ! " Traditional distributed computing projects have been making use of GPUs for some time (non-commercial) ! " Typically have 000’s to 10,000’s of contributors ! " Folding@Home has access to 6.5 PFLOPS of compute ! " Of which ~95% comes from GPUs or PS3s ! " Many are bio-informatics, molecular dynamics and quantum chemistry codes ! " Represent the current sweet spot applications ! " Ubiquity of GPUs in home systems helps © NVIDIA Corporation 2010 Distributed Computing Projects ! " Folding@Home ! " Directed by Prof. Vijay Pande at Stanford University (http://folding.stanford.edu/) ! " Most recent GPU3 Core based on OpenMM 1.0 (https://simtk.org/home/openmm) ! " OpenMM library provides tools for molecular modeling simulation ! " Can be hooked into any MM application, allowing that code to do molecular modeling with minimal extra effort ! " OpenMM has a strong emphasis on hardware acceleration providing not just a consistent API, but much