revolutionizing high performance computing nvidia® tesla™ Revolutionizing high performance computing / NVIDIA® TESLATM

table of contents

revolutionizing high GPUs are revolutionizing computing 4 performance computing GPU architecture and developer tools 6 ® ™ Accelerate your code with OpenACC nvidia tesla 8 Directives

GPU acceleration for science and research 10

gpu acceleration for MatLab users 12

NVIDIA GPUs for designers and 14 engineers

GPUs accelerate CFD and Structural 16 mechanics

MSC Nastran 2012 cst and weather codes 18

Tesla™ Bio Workbench: Amber on GPUs 20

GPU accelerated bio-informatics 22

GPUs accelerate molecular dynamics 24 and medical imaging

gpu computing case studies: 26 The need for computation in the high oil and gas performance computing (HPC) Industry gpu computing case studies: 28 is increasing, as large and complex finance and supercomputing computational problems become commonplace across many industry Tesla gpu computing solutions 30 segments. Traditional CPU technology, gpu computing solutions: 32 however, is no longer capable of tesla C scaling in performance sufficiently to gpu computing solutions: 34 address this demand. tesla Modules

The parallel processing capability of the GPU allows it to divide complex computing tasks into thousands of smaller tasks that can be run concurrently. This ability is enabling computational scientists and researchers to address some of the world’s most challenging computational problems up to several orders of magnitude faster.

2 3 Revolutionizing high performance computing / NVIDIA® TESLATM

The world’s top 5 Cost effectiveness chart: TFLOPs per 1 Million $ of investment s s s s s NVIDIA® Tesla™ GPU Computing 8190 TFLOP 2570 TFLOP 1760 TFLOP 1270 TFLOP 1190 TFLOP Revolutionizing High Performance Computing 40 10

30 7,5

The high performance computing We not only created The rise of GPU 20 5 (HPC) industry’s need for computation a r e s ower, M egawatt * ower,

the world‘s fastest , supercomputers on the P is increasing, as large and complex but also implemented a Green500 signifies that GPU evolutionizing r evolutionizing computing computational problems become * s per 1 M illion $ of investment 10 2,5 Supercomputers with NVIDIA GPU heterogeneous computing commonplace across many industry heterogeneous systems, Other Supercomputers... CPU based architecture incorporating segments. Traditional CPU technology, built with both GPUs TFLOP “ “ Power consumption however, is no longer capable of scaling 0 0 CPU and GPU, this is a new and CPUs, deliver the innovation.” in performance sufficiently to address highest performance and “K” Computer Tianhe-1a Jaguar Nebulae Tsubame 2.0 this demand. Premier Wen Jiabao unprecedented energy * All data is taken from Top500 list and Media announcements open to public. People’s Republic of China The parallel processing capability of efficiency,” the GPU allows it to divide complex said Wu-chun Feng, computing tasks into thousands founder of the Green500 and of smaller tasks that can be run associate professor of concurrently. This ability is enabling Computer Science at Virginia Tech. GpU supercomputing - Green hpc computational scientists and WHY GPU COMPUTING? researchers to address some of the With the ever-increasing demand for more GPUs significantly increase overall system As sequential processors, CPUs are not world’s most challenging computational computing performance, systems based efficiency as measured by performance designed for this type of computation, problems up to several orders of on CPUs alone can no longer keep up. The per watt. “Top500” supercomputers based but they are adept at more serial based magnitude faster. CPU-only systems can only get faster by on heterogeneous architectures are, on tasks such as running operating systems This advancement represents a dramatic adding thousands of individual average, almost three times more power- and organizing data. NVIDIA believes in shift in HPC. In addition to dramatic – this method consumes too much efficient than non-heterogeneous systems. applying the most relevant processor to the improvements in speed, GPUs also power and makes supercomputers very This is also reflected on Green500 list – the specific task in hand. consume less power than conventional GPUs deliver superior performance in expensive. A different strategy is parallel icon of Eco-friendly supercomputing. CPU-only clusters. GPUs deliver parallel computation. computing, and the HPC industry is moving performance increases of 10x to 100x GPUs deliver superior performance in parallel computation. toward a hybrid computing model, where to solve problems in minutes instead of Core comparison between a CPU and a gpu GPUs and CPUs work together to perform hours–while outpacing the performance general purpose computing tasks. of traditional computing with x86-based 1200 CPUs alone. As parallel processors, GPUs excel at tackling large amounts of similar data Figure 2: The new computing model 1000 CPU From climate modeling to advances in because the problem can be split into includes both a multi-core CPU and a GPU medical tomography, NVIDIA® Tesla™ hundreds or thousands of pieces and GPU with hundreds of cores. 800 GPUs are enabling a wide variety of calculated simultaneously. segments in science and industry to 600 progress in ways that were previously impractical, or even impossible, due to 400 technological limitations.

200

igaflops/sec (double precision) (double G igaflops/sec Figure 1: Co-processing refers to Multiple 0 the use of an accelerator, such as a Cores Hundreds of Cores 2007 2008 2009 2010 2011 2012 GPU, to offload the CPU to increase computational efficiency. CPU GPU

5 Revolutionizing high performance computing / NVIDIA® TESLATM

The next generation CUDA DEVELOPER ECOSYSTEM parallel computing architecture,

I believe history will codenamed “Fermi”. In just a few years, an entire software record Fermi as a significant research & milestone.” ecosystem has developed around the Education LIBRARIES CUDA architecture – from more than Integrated Dave Patterson, Director, Parallel Computing 400 universities worldwide teaching development Mathematical Research Laboratory, U.C. Berkeley Co-author of environment packages “Computer Architecture: A Quantitative Approach the CUDA programming model, to a wide range of libraries, compilers, and languages CUDA consultants, middleware that help users optimize & APIs training, & R E A R CHITECTU certification applications for GPUs. This rich ecosystem has led to faster discovery All major Tools GPU R D EVELOPE AN D S TOOL platforms & Partners and simulation in a wide range of fields including mathematics, life sciences, and manufacturing.

CUDA PARALLEL COMPUTING Locate a CUDA teaching center near you: Request a CUDA education course at: ARCHITECTURE research.nvidia.com/content/cuda- www.parallel-compute.com CUDA® is NVIDIA’s parallel computing courses-map architecture and enables applications to run large, parallel workloads The current generation CUDA on NVIDIA GPUs. Applications that architecture, codenamed “Fermi”, is Developers tools leverage the CUDA architecture can the most advanced GPU computing be developed in a variety of languages architecture ever built. With over three and APIs, including C, C++, Fortran, billion transistors, Fermi is making NVIDIA Parallel Nsight software integrates CPU OpenCL, and DirectCompute. The GPU and CPU co-processing pervasive and GPU development, allowing developers to create optimal GPU-accelerated applications. CUDA architecture contains hundreds by addressing the full-spectrum of Just a few applications that can of cores capable of running many computing applications. With support experience significant performance Request hands-on Parallel Nsight training: thousands of parallel threads, while for C++, GPUs based on the Fermi benefits include ray tracing, finite www.parallel-compute.com the CUDA programming model lets architecture make parallel processing element analysis, high-precision NVIDIA Parallel Nsight programmers focus on parallelizing easier and accelerate performance on scientific computing, sparse linear Development Environment for their algorithms and not the mechanics a wider array of applications than ever algebra, sorting, and search visual studio GPU-ACCELERATED LIBRARIES of the language. before. algorithms. NVIDIA Parallel Nsight software is the Take advantage of the massively parallel EM Photonics CULA industry’s first development environment computing power of the GPU by using the A GPU-accelerated linear algebra (LA) for massively parallel computing gpu computing applications GPU-accelerated versions of your existing library that dramatically improves the integrated into Microsoft Visual Studio, libraries. Some examples include: performance of sophisticated mathematics. the world’s most popular development Libraries and Middleware environment for Windows-based NVIDIA Math Libraries MAGMA applications and services. It integrates A collection of GPU-accelerated A collection of open-source, next- Language Solutions Device level APIs CPU and GPU development, allowing libraries—including FFT, BLAS, sparse generation linear algebra libraries developers to create optimal GPU- matrix operations, RNG, performance for heterogeneous GPU-based Java and accelerated applications. primitives for image/signal processing, architectures supporting interfaces to C C++ Fortran Phython Direct OpenCL and the Thrust C++ library of high- current LA packages and standards (e.g. Interface Compute Parallel Nsight supports Microsoft Visual performance templated algorithms— LAPACK and BLAS). Studio 2010, advanced debugging and that all deliver significant speedups analysis capabilities, as well as Tesla RogueWave IMSL when compared to CPU-only libraries. 20-series GPUs. A comprehensive set of mathematical These highly optimized libraries are NVIDIA GPU cuda parallel computing architecture and statistical functions for Fortran free of charge in the NVIDIA® CUDA® applications to take advantage of GPU For more information, visit: developer. Toolkit available at acceleration. The CUDA parallel computing architecture, with a combination of hardware and software. nvidia.com/object/nsight.html. www.nvidia.com/getcuda

6 7 Revolutionizing high performance computing / NVIDIA® TESLATM

Accelerate Your Code Easily with OpenACC Directives Get 2x Speed-up in 4 Weeks or Less OpenACC directives R OU R Y The OpenACC Application Program The directives and programming O pen ACC Interface describes a collection of model defined in this document allow I have written Accelerate your code with directives OpenACC is: compiler directives to specify loops and programmers to create high-level

and tap into the hundreds of computing W ITH micromagnetic codes (written • Easy: simply insert hints in your regions of code in standard C, C++ and host+accelerator programs without cores in GPUs. With directives, you in Fortran 90) to study the codebase Fortran to be offloaded from a host CPU the need to explicitly initialize the simply insert compiler hints into your • Open: run the single codebase on to an attached accelerator, providing accelerator, manage data or program R ATE ACCELE CO D E S D I R ECTIVE properties of two and three code and the compiler will automatically either the CPU or GPU portability across operating systems, transfers between the host and dimensional magnetic map compute-intensive portions of your “ • Powerful: tap into the power of GPUs host CPUs and accelerators. accelerator, or initiate accelerator systems. The directives code to the GPU. within hours startup and shutdown. approach enabled me to port By starting with a free, 30-day trial of Coming in Q1 2012, OpenACC is my existing code with ease PGI directives today, you are working on supported by industry parallel computing to perform my computations the technology that is the foundation of tool leaders: Cray, CAPS, and The on the GPU which resulted the OpenACC directives standard. Portland Group (PGI). in a significant speedup (more than 20 times) of the For a free 30 day trial of computation.” the directives approach Professor M. Amin Kayali, Register Now! University of Houston www.nvidia.eu/openacc

main() { double pi = 0.0f; long i; DIRECTIVE-BASED SOLUTIONS #pragma acc region Directives allow you to quickly add GPU CAPS HMPP The PGI compiler for (i=0; i

* Use your phone, or tablet PC with QR reader software to read the QR code.

8 9 Revolutionizing high performance computing / NVIDIA® TESLATM r e s ea ch ation r ation accele Science GPU r fo an d

100X 30X 36x Astrophysics Gene Sequencing Molecular Dynamics RIKEN University of Maryland University of Illinois, Urbana-Champaign

“ NVIDIA’s Tesla Fermi card includes almost everything what scientists ever wanted” CT magazine (Germany)

50x 80x 146x 149x MATLAB Computing, Weather Modeling Medical Imaging Financial Simulation AccelerEyes Tokyo Institute of U of Utah Oxford University Technology

10 11 Revolutionizing high performance computing / NVIDIA® TESLATM

matlab Accelerations on tesla Gpus Matlab performance with Tesla u s e rs

NVIDIA and MathWorks have operations and lets you integrate your b matla collaborated to deliver the power of existing CUDA kernels into MATLAB GPU computing for MATLAB users. applications. r ation gpu accele r fo Available through the latest release of MATLAB 2010b, NVIDIA GPU accele- The latest release of Parallel Computing ration enables faster results for users Toolbox and MATLAB Distributed of the Parallel Computing Toolbox Computing takes advantage of and MATLAB Distributed Computing the CUDA parallel computing architecture Server. MATLAB supports NVIDIA® to provide users the ability to CUDA™-enabled GPUs with compute • Manipulate data on NVIDIA GPUs capability version 1.3 or higher, such Image - courtesy of the Maryland CPU-GPU Cluster team. as Tesla™ 10-series and 20-series • Perform GPU accelerated GPUs. MATLAB CUDA support provides MATLAB operations the base for GPU-accelerated MATLAB • Integrate users own CUDA kernels into MATLAB applications

• Compute across multiple NVIDIA Potential field extrapolation on magnetogram GPUs by running multiple MATLAB Size: NX=994px, NY=484px 43x workers with Parallel Computing 1600 Jacket РЗС - GPU Time Relative performance, Toolbox on the desktop and MATLAB NVIDIA Tesla C2050 (6 GPUs) 1403 1400 point-in-Polygon demo Distributed Computing Server on a MATLAB PCT - CPU Time Compared to Single Core CPU Baseline compute cluster 1200 AMD Athlon 64 X2 5200 (6 cores) GPU Computing in matlab with accelereyes jacket 1000 Single Core CPU Single Core CPU + Tesla 1060 Quad Core CPU Quad Core CPU + Tesla 1060

ime (min) ime 800 TESLA BENEFITS 701 Jacket includes many key features to T 20,0 Highest Computational Performance 600 • High-speed double precision deliver results on full applications: 18,0 22x 400 349 16,0 operations • Over 500 functions, including

• Large dedicated memory math, signal processing, image 175 14,0 200 88 33 • High-speed bi-directional 2 43 3 5 11 18 12,0 processing, and statistics PCIe communication 0 6 12 24 48 96 192 10,0 • Specialized FOR loops to run many • NVIDIA GPUDirect™ with InfiniBand NZ (px) 8,0 iterations simultaneously Most Reliable 6,0 • ECC memory • An optimized runtime to optimize relative execution speed execution relative 4,0 • Rigorous stress testing memory bandwidth and kernel 2,0 configurations RECOMMENDED TESLA & QUADRO CONFIGURATIONS Best Supported 0 • Integrate users’ own CUDA kernels • Professional support network 1024 4096 16384 65536 into MATLAB via the Jacket SDK High-End Mid-Range Workstation Entry Workstation • OEM system integration • Two Tesla C2070/ • Tesla C2070/C2075 • Tesla C2070/C2075 Core 2 Quad Q6600 2.4 GHz, 6 GB RAM, Windows 7, 64-bit, Tesla C1060, single • Long-term product lifecycle • Compute across multiple NVIDIA precision operations. C2075 GPUs GPUs GPUs • 3 year warranty GPUs via Jacket MGL and HPC • Quadro 2000 • Quadro 400 • Quadro 400 http://www.mathworks.com/products/distriben/demos.html?file=/products/ • Cluster & system management demos/distribtb/MapDemo/MapDemo.html With Jacket programming, the MATLAB • Two quad-core CPUs • Quad-core CPU • Single quad-core CPU tools (server products) • 24 GB system memory • 12 GB system memory • 6 GB system memory community can enjoy GPU-acceleration • Windows remote desktop support with an easy, high-level interface.

12 13 Revolutionizing high performance computing / NVIDIA® TESLATM

A new feature in ANSYS Mechanical leverages

graphics processing units to r significantly lower solution fo an d “ s times for large analysis JOIN THE RENDERING REVOLUTION problem sizes.” By Jeff Beisheim, GPU AND MAXIMIZE YOUR SPEED TO PHOTOREALISM Senior Software Developer, ANSYS, Inc. WITH IRAY AND NVIDIA FERMI CLASS GPUs d e s igne rs enginee rs NVI D IA

The iray workflow is further enhanced Tesla™ C2050. Designers looking for with massive acceleration from NVIDIA the fastest iray results can further Graphics Processing Units (GPUs) based boost their speed by adding additional Speed up ansys simulations with a gpu on the NVIDIA® CUDA™ architecture. NVIDIA GPUs to their system. While iray produces identical images on With ANSYS® Mechanical™ 14.0 and ACTIVATE A GPU The massive processing power and large either CPUs or GPUs, 3ds Max users will NVIDIA® Professional GPUs, you can: WITH ANSYS HPC PACK Future Directions frame buffers of the new Fermi-based enjoy up to 6x faster results Quadro and Tesla GPUs dramatically • Improve product quality with 2x more To unlock the GPU feature in ANSYS As GPU computing trends evolve, ANSYS over a quad-core CPU² when using an decrease time to photorealistic results. design simulations Mechanical 14.0, you must have an will continue to enhance its offerings NVIDIA® Quadro® 5000 or NVIDIA® • Accelerate time-to-market by ANSYS HPC Pack license, the same as necessary for a variety of simulation reducing engineering cycles scheme also required for going parallel products. Certainly, performance ¹ 3ds Max 2011 subscription content required ² 3ds Max 2011 64-bit on Win 7 64-bit with 8GB of system memory using a Quadro 5000 or Tesla C2050 vs. • Develop high fidelity models with for greater than 2 CPU cores. improvements will continue as GPUs Recommended an Intel® Q9300 quad core processor become computationally more powerful ³ Benchmark testing conducted on Intel Core 2 Quad (Q9300) @ 2.5GHz with 8gb system memory with Windows 7 practical solution times For academic license users, the GPU Configurations 64-bit. Average results recorded across several 3ds Max scenes and extend their functionality to other How much more could you accomplish capability is included with the base areas of ANSYS software. IRAY SPECIALIST if simulation times could be reduced ANSYS Academic license that provides Quadro 5000 + Tesla C2050 from one day to just a few hours? As access to ANSYS Mechanical and no add- > The visualization specialist wanting to on Academic HPC licenses are required. 3DS MAX IRAY RENDERING BENCHMARK3 an engineer, you depend on ANSYS increase the performance of iray Mechanical to design high quality > Delivers up to 10x iray speed up over 12.47 12 a quad core CPU products efficiently. To get the most 10.55 out of ANSYS Mechanical 14.0, simply 10 5 IRAY EXPERT 9.33 upgrade your Quadro GPU or add a Tesla Quadro 5000 + (3) Tesla C2050 8 GPU to your workstation, or configure > The visualization expert needing the 4,4 a server with Tesla GPUs, and instantly 4 fastest possible iray results 6 6.69 GPU Speed-up > Delivers a blazing +20x iray speed up erformance unlock the highest levels of ANSYS GPU Speed-up over a quad core CPU 4 simulation performance. Solution Cost 3

S Mechanical 14.0 Y S Mechanical NOTE: Invest 38% more over

Relative P Relative 2 base license for > 4x!

1.00 Solver Sparse Y S Direct 0 2 2,3

Quad Core Tesla Dual Quadro Quadro 5000 & Dual Tesla CPU C2050 5000 Tesla C2050 C2050 Recommended 1 Configurations Preview 3 DANS Preview Results B ased on ANS 1,0 1,0 1,35 1,38 Workstation icense R esults B ase L icense G ain O ver actors

• Tesla C2075 + Quadro 2000 F 0 or Watch video*: • Quadro 6000 • 48 GB System Memory Visualise as never before with NVIDIA Maximus. Server ANSYS Base License ANSYS HPC Pack ANSYS HPC Pack NVIDIA’s CEO Jen-Hsun Huang’s keynote at SC11 • 2x Tesla M2090 2 Cores 8 Cores 6 Cores + GPU • Dual-socket CPU * Use your phone, smartphone or tablet PC with QR reader • 128 GB System Memory Results from HP Z800 Workstation, 2 x X5670 2.93GHz software to read the QR code. 48GB memory, CentOS 5.4 x64; Tesla C2075, CUDA 4.0.17

14 15 design draft. aerodynamic features ofthevehicle’s designer to doaparallel checkof of theirlatest designdraft –enablingthe designer can doadirect flow simulation Due to thiscoupling system, the directly withinRTTDeltaGen. simulation computing can beproceeded DeltaGen byRTT. Asaconsequence, flow high-end 3Dvisualizationsoftware RTT prototype, itistightly integrated into the processing units(GPUs).Asaplug-in works particularly fast ongraphics The flow simulationsoftware LBultra LBultra developed by LBultra Plug-in for RTT directl F ast Flo 16 y w within RTT Simulations Delta Gen Fluid over to LBultra andthesimulation is the geometryofdesignishanded After determining theoverall simulation, result‘s accuracy. influence the calculation timeand resolution levels are set,whichalso conditions suchasflow rates or simulation parameters andboundary or anoutsidemirror. Next, various selected, suchasanalyzing aspoiler First ofall,acertain scenario is Aerod Improved Optimum Performance through earl Revolutionizing high performance computing / NVIDIA yna ynamicsinto Delta Gen Design

Relative Performance vs CPU 75x F 3 T 3x T esla GPUs esla C aster with specialist programs andtools. using highly-capable fluidmechanics analyzed inmore detail byexperts results maythen,for instance, befurther exporting simulationresults. These In addition,there isanopportunityof RTT DeltaGen. data isbeingvisualizedinrealtime in started. While thesimulationisrunning, 2070 20X F ® TESLA 1 T T esla C esla GPU aster with TM 2070

y Involvement of I nrel Quad- Xeon C ore manufacturers. world‘s leading professional system and are available andsupported bythe the most reliable numerical accuracy, highest computational performance with products are designedto deliver the Tesla andQuadro GPUcomputing than 2.6xandfor 4cores greater than2x. (Westmere) CPUwith2cores greater GPU acceleration ofanIntel Xeon5680 demonstrated thattheTesla C2070 together withSIMULIAengineers Recent performance studies conducted complex assembly orofnew materials. on Abaqusto understand behaviorof for product engineers. Engineers rely has beenever increasingly difficult task ofinnovating withmore confidence As products getmore complex, the SIMULIA Time in Seconds (Lower is Better) 2000 3000 2500 1000 1500 500 0 s4b EngineModel Abaqus/Standard 6.11Results:

3020 2 C Abaqus /Standard ore 1161 2.6x 1714 4 C Xeon X5680 3.33 GHz Westmere + Tesla C2070 +Tesla Westmere GHz 3.33 X5680 Xeon Xeon X5680 3.33 GHz Westmere (Dual Socket) (Dual Westmere GHz 3.33 X5680 Xeon ore 2.0x 957 1239 6 C ore 1.7x 725 999 8 C ore 1.5x 662 • • • Server • • • Workstation Configurations Recommended Tesla 128 GBSystemMemory Dual-socket Quad-coreCPU 2x TeslaM2090 48 GBSystemMemory Dual-socket Quad-coreCPU Tesla C2070

GPUs accelerate

17 CFD and Structural mechanics Revolutionizing high performance computing / NVIDIA® TESLATM

MSC Nastran 2012 NVIDIA GPUs ACCELERATE WEATHER PREDICTION

5X performance boost with single GPU over single core,

>1.5X with 2 GPUs over 8 core ASUCA (Weather Modeling) Japan’s Terascale Weather Simulation r w eathe

Regional weather forecasting demands GPU (single precision) 103 an 2012 2012 N a s t r an

fast simulation over fine-grained GPU (double precision) CPU (double precision) grids, resulting in extremely memory- 102

bottlenecked computation. ASUCA 10 M S C an d c s t co d e s is the first high- resolution weather Sol101, 3.4M DOF 1 prediction model ported fully to CUDA. 1 Core 1 GPU + 1 Core 2 GPU + 2 Core 4 Core (SMP) 8 Core (DMP=2) (DMP=2) 10-1 7 ASUCA is a next-generation, • Nastran direct equation solver is GPU 10-2 6 production weather code developed 1.8x

accelerated [ TF lops] erformance 10-3

5 by the Japan Meteorological Agency, P 1 10 102 103 – Real, symmetric sparse direct 1.6x similar to WRF in the underlying • Support of multi-GPU and for both 4 factorization 5.6x Number of GPUs / CPU Cores 4.6x physics (non-hydrostatic model). – Handles very large fronts with Linux and Windows 3

minimal use of pinned host memory – With DMP >1, multiple fronts are 2 – Impacts SOL101, SOL103, SOL400 factorized concurrently on multiple 1 that are dominated by MSCLDL GPUs; 1 GPU per matrix domain NOAA NIM 0 factorization times – NVIDIA GPUs include Tesla 3.4M DOF; fs0; Total 3.4M DOF; fs0; Solver Exploring GPU Computing to – More of Nastran (SOL108, SOL111) 20-series and Quadro 6000 Speed Up Refine Weather Forecasting * fs0= NVIDIA PSG cluster node Linux, 96GB memory, will be moved to GPU in stages – CUDA 4.0 and above 2.2 TB SATA 5-way striped RAID Tesla C2050, Nehalem 2.27ghz Earth System Research Lab in the NIM Dynamics package has been National Oceanic & Atmospheric ported over to CUDA for single GPU Administration (NOAA) of the United implementation. NOAA is actively States has developed a next generation working on the code to run parallel jobs 20x Faster Simulations with GPUs global model to more accurately and on multiple GPUs. Design Superior Products with CST Microwave Studio efficiently forecast weather. The Non- The benchmark below shows ~20x hydrostatic Icosahedral Model (NIM) improvement in computation time uses the icosahedral horizontal grid by using GPUs compared to a 1 core and is designed to run on thousands of 23x Faster with Nehalem Intel CPU. Due to significant Tesla GPUs Recommended Tesla processors including Tesla GPUs. 25 computational speedup with GPUs, inter- Configurations process communication now becomes 20 Workstation the bottleneck in simulation. • 4x Tesla C2070 • Dual-socket Quad-core CPU 15 • 48 GB System Memory 9X Faster with Server What can product engineers achieve if a Tesla GPU 10 • 4x Tesla M2090 single simulation run-time reduced from CPU vs erformance • Dual-socket Quad-core CPU Input / output time CPU Compute time GPU Compute time Inter-process communications 48 hours to 3 hours? CST Microwave • 48 GB System Memory 5 Studio is one of the most widely used elative P R elative electromagnetic simulation software CPU Time 100 sec 5 100 sec 5 100 sec 5 100 sec 0 and some of the largest customers in 2x CPU 2x CPU 2x CPU + + the world today are leveraging GPUs to 1x C2070 4x C2070 Benchmark: BGA models, 2M to 128M mesh models. GPU Time 5 5 5 5 5 5 5 introduce their products to market faster CST MWS, transient solver. and with more confidence in the fidelity CPU: 2x Intel Xeon X5620. Single GPU run only on 50M mesh model. of the product design. source: http://www.esrl.noaa.gov/research/events/espc/8sep2010/Govett_ESRL_GPU.pdf

18 19 Revolutionizing high performance computing / NVIDIA® TESLATM

1,5

50% Faster with GPUs

1,0 JAC NVE Benchmark (left) 192 Quad-Core CPUs erformance S cale erformance simulation run on Kraken Supercomputer (normalized ns/day) (right) Simulation 2 Intel Xeon Quad-Core CPUs and 4 Tesla s elative P R elative M2090 GPUs 0,5 192 Quad-Core CPUs 2 Quad-Core CPUs + 4 GPUs Bio Bio on GPU on ™ ench: r k b ench: Wo A m b e r T e s la

16 CPUs Processors 16 CPUs + 24 GPUs 0.36 ns/day 10x Faster 3.44 ns/day 7,737 kJ 4x Energy Savings 1,142 kJ

“Molecular Dynamics Simulation of a Biomolecule with High Speed, Low Power and Accuracy Using GPU-Accelerated TSUBAME2.0 Supercomputer”, Du, Udagawa, Endo, Sekijima

Enabling new Science Leverage supercomputer-like performance for Tesla™ bio workbench your AMBER research with Tesla GPUs

The NVIDIA Tesla Bio Workbench It consists of bioscience applications; Applications that are accelerated on Amber enables biophysicists and a community site for downloading, GPUs include: RECOMMENDED HARDWARE CONFIGURATION Researchers today are solving the computational chemists to push discussing, and viewing the results of • Molecular Dynamics & world’s most challenging and important Workstation Server the boundaries of life sciences these applications; and GPU-based Quantum Chemistry problems. From cancer research to research. It turns a standard PC into platforms. • 4xTesla C2070 • 8x Tesla M2090 AMBER, GROMACS, GAMESS, drugs for AIDS, computational research a “computational laboratory” capable • Dual-socket Quad-core CPU • Dual-socket Quad-core CPU Complex molecular simulations that had HOOMD, LAMMPS, NAMD, TeraChem is bottlenecked by simulation cycles per of running complex bioscience codes, • 24 GB System Memory Server • 128 GB System Memory been only possible using supercomputing (Quantum Chemistry), VMD day. More simulations mean faster time in fields such as drug discovery and • Up to 8x Tesla M2090s in cluster resources can now be run on an to discovery. To tackle these difficult DNA sequencing, more than 10- • Bio Informatics • Dual-socket Quad-core CPU per node individual workstation, optimizing the challenges, researchers frequently rely 20 times faster through the use of CUDA-BLASTP, CUDA-EC, CUDA- • 128 GB System Memory scientific workflow and accelerating the on national supercomputers for computer NVIDIA Tesla GPUs. MEME, CUDASW++ (Smith- pace of research. These simulations can Waterman), GPU-HMMER, simulations of their models. also be scaled up to GPU-based clusters MUMmerGPU GPUs offer every researcher of servers to simulate large molecules supercomputer-like performance in their 4 Tesla M2090 GPUs and systems that would have otherwise For more information, visit: 192 Quad-Core CPUs own office. Benchmarks have shown + 2 CPUs required a supercomputer. www.nvidia.com/bio_workbench four Tesla M2090 GPUs significantly 69 ns/day 46 ns/day outperforming the existing world record on CPU-only supercomputers. Watch video*: The computational science behind Shampoo TRY THE TESLA SIMCLUSTER A ready to use solution * Use your phone, smartphone or tablet PC with QR reader for science and research. software to read the QR code. www.nvidia.eu/cluster

20 21 Revolutionizing high performance computing / NVIDIA® TESLATM

GROMACS 120

GROMACS is a molecular dynamics 100 package designed primarily for simulation of biochemical molecules 80 like proteins, lipids, and nucleic 60 acids that have a lot complicated 40

bonded interactions. The CUDA port of erformance (ns/day) P GROMACS enabling GPU acceleration 20 supports Particle-Mesh-Ewald d r ate accele

0 (PME), arbitrary forms of non-bonded Impl 1nm Impl Inf PME GPU s r matic b io-info

interactions, and implicit solvent Tesla C2050 + 2x CPU 2x Cpu Generalized Born methods. Benchmark system details: CPU: Dual Socket Intel Xeon E54308T

RECOMMENDED HARDWARE CONFIGURATION CUDA-EC Workstation CUDA-EC is a fast parallel sequence • 1xTesla C2070 error correction tool for short reads. • Dual-socket Quad-core CPU It corrects sequencing errors in • 12 GB System Memory CUDA-BLASTP GPU-HMMER high-throughput short-read (HTSR) CUDA-BLASTP is designed to GPU-HMMER is a bioinformatics data and accelerates HTSR by accelerate NCBI BLAST for scanning software that does protein sequence taking advantage of the massively CUDA-MEME protein sequence databases by taking alignment using profile HMMs by parallel CUDA architecture of NVIDIA advantage of the massively parallel taking advantage of the massively CUDA-MEME is a motif discovery Tesla GPUs. Error correction is a CUDA architecture of NVIDIA Tesla parallel CUDA architecture of NVIDIA software based on MEME (version 3.5.4). It preprocessing step for many DNA GPUs. CUDA-BLASTP also has a utility Tesla GPUs. GPU-HMMER is 60-100x accelerates MEME by taking advantage of fragment assembly tools and is very to convert FASTA format database into faster than HMMER (2.0). the massively parallel CUDA architecture useful for the new high-throughput files readable by CUDA-BLASTP. of NVIDIA Tesla GPUs. It supports the sequencing machines. OOPS and ZOOPS models in MEME.

CUDA-BLASTP running on a workstation with two Tesla C1060 GPU-HMMER accelerates the hmmsearch tool using GPUs and gets GPUs is 10x faster than NCBI BLAST (2.2.22) running on an Intel speed-ups ranging from 60-100x. GPU-HMMER can take advantage of CUDA-MEME running on one Tesla C1060 GPU is up to 23x faster than CUDA-EC running on one Tesla C1060 GPU is up to 20x faster than i7-920 CPU. This cuts compute time from minutes of CPUs to multiple Tesla GPUs in a workstation to reduce the search from hours MEME running on a x86 CPU. This cuts compute time from hours on Euler-SR running on a x86 CPU. This cuts compute time from seconds using GPUs. on a CPU to minutes using a GPU. CPUs to minutes using GPUs. The data in the chart below are for the minutes on CPUs to seconds using GPUs. The data in the chart OOPS (one occurrence per sequence) model for 4 datasets. below are error rates of 1%, 2%, and 3% denoted by A1, A2, A3 and cuda-blastp vs ncbi blastp Speedups hmmer: 60-100x faster so on for 5 different reference genomes. cuda-meme vs meme speedups Data courtesy of Nanyang Technological University, Singapore Data courtesy of Nanyang Technological University, Singapore cuda-EC vs Euler-SR Speedups Data courtesy of Nanyang Technological University, Singapore ncbi blastp: 1 Thread Intel i7-920 CPU cuda blastp: 1 Tesla C1060 GPU Data courtesy of Nanyang Technological University, Singapore ncbi blastp: 4 Thread Intel i7-920 CPU cuda blastp: 2 Tesla C1060 GPUs AMD Opteron 2.2 GHz Opteron 2376 2.3 GHz 2 Tesla C1060 GPUs Tesla C1060 GPU Euler-SR: AMD Opteron 2.2 GHz 3,6 sec 3,6 sec 8,5 sec 1 Tesla C1060 GPU 3 Tesla C1060 GPUs 57 mins 4 hours CUDA-EC: Tesla C1060 25 15 mins 120 10 21 sec 15 mins 8 mins 20 22 sec 100 20 22 mins 8 80 21 mins 15 11 mins 15 27 mins 6 60

S peedup 10

37 mins S peedup

10 S peedup 40 21 mins 40 mins S peedup 4 5 20 69 mins 5 11 mins 25 mins 2 0 28 mins 0 0 A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 E 789 1431 2295 0 S. cer 5 S. cer 7 H. inf E. col S. aures Mini-drosoph HS_100 HS_200 HS_400 0.58 M 1.1 M 1.9 M 4.7 M 4.7 M # of States (HMM Size) 4 sequences 100 sequences 200 sequences 400 sequences 127 254 517 1054 2026 Read Length = 35 Error rates of 1%, 2%, 3%

22 23 Revolutionizing high performance computing / NVIDIA® TESLATM

NAMD

The Team at University of Illinois at NAMD 2.8 Benchmark

Urbana-Champaign (UIUC) has been enabling CUDA-acceleration on NAMD Ns/Day

since 2007, and the results are simply an d 3.5 stunning. NAMD users are experiencing tremendous speed-ups in their research 3

using Tesla GPU. Benchmark (see r ate s accele 2.5 GPU+CPU below) shows that 4 GPU server nodes namic s dy out-perform 16 CPU server nodes. It 2 GPU molecula r imaging me d ical also shows GPUs scale-out better than 1.5 CPUs with more nodes. CPU only 1 Scientists and researchers equipped with powerful GPU accelerators have 0.5 reached new discoveries which were 0 impossible to find before. See how precision) G igaflops/sec (double 2 4 8 12 16 NVIDIA’S TECHNOLOGY SET TO SPAWN CORONARY other computational researchers are CAT SCANNER OF THE FUTURE experiencing supercomputer-like # of Compute Nodes performance in a small cluster, and NAMD 2.8 B1, STMV Benchmark A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs THE IMPACT take your research to new heights. THE CHALLENGE THE SOLUTION NAMD courtesy of Theoretical and Computational Bio-physics Group, UIUC Many deaths from heart failure are The practice is incredibly complicated. The system is still at the testing stage caused by problems in the coronary First - the need to define what is coronary and it will be some time before it is widely arteries and by pathologies that are artery and what is not, within images that used in national health service hospitals. visible, to varying degrees, using are far from easy to interpret. This done, However, the developments that it may traditional methodologies. specially developed software is used to spawn in the medical/diagnostic arena, attempt to interpret the coronary map and thanks to the processing power of the The study of blood flows involves then define the blood flow, thus putting GPUs, are not hard to imagine. reconstructing the precise geometry of the ambitious project into practice. the network of arteries that surrounds During the testing phase, a serious the heart, which can be mapped using This obviously takes an enormous toll anomaly was detected in a member of the images deriving from high-resolution in terms of processing power, given research staff, who showed no outward computerised tomography. To map these that the result needs to be produced signs of ill health but had volunteered to LAMMPS sections a machine of similar design to in real time. The interpolation of the take part in the initial experiments out of that of a CAT scanner, but with much images and subsequent creation of pure curiosity. An emergency operation LAMMPS is a classical molecular higher resolution is being used. The the map and the fluid dynamics model prevented the worst from happening, thus dynamics package written to run higher the number of sections mapped, requires processing power of between pointing the way to extremely promising well on parallel machines and is RECOMMENDED HARDWARE the higher the fidelity of the model. 10 and 20 Teraflops, which is possible scenarios in the field of preventive maintained and distributed by Sandia CONFIGURATION thanks to the high power of NVIDIA medicine. A plaque of cholesterol was National Laboratories in the USA. It is TESLA GPUs. GPUs have already been on the verge of entirely obstructing one 300x Workstation a free, open-source code. of the coronary arteries, but the problem 250x successfully applied to supercomputing, was detected thanks to the 3D view of the 200x LAMMPS has potentials for soft • 4xTesla C2070 but now they are being used for anomalous blood flow. 150x materials (biomolecules, polymers) • Dual-socket Quad-core CPU medical applications too, because they • 24 GB System Memory offer higher analytical capacity than 100x and solid-state materials (metals, This use of 3D graphics in diagnostic Server traditional diagnostic systems. 50x semiconductors) and coarse-grained applications will save many human lives • 2x-4x Tesla M2090 per node x or mesoscopic systems. in the future, considering that, in the 3 6 12 24 48 96

ocket CPU s D ual S ocket S peedup vs • Dual-socket Quad-core CPU per experimentation stage alone, it saved # of GPUs The CUDA version of LAMMPS is node Benchmark system details: the life of a member of the research Each node has 2 CPUs and 3 GPUs. CPU is Intel Six-Core Xeon accelerated by moving the force • 128 GB System Memory per node Speedup measured with Tesla M2070 GPU vs without GPU project team. Source: http://users.nccs.gov/~wb8/gpu/kid_single.htm calculations to the GPU.

24 25 Revolutionizing high performance computing / NVIDIA® TESLATM

Before After

: s ga e s tu d ie a s e gpu COMPUTING gpu C an d oil

GPU enhanced voxel rendering providing, improved visualization of karsts in Barnett Shale

OIL & GAS: Paradigm VoxelGeo ffA Redefining Interpretation with NVIDIA Tesla GPUs Changing the way 3D seismic interpretation is carried out

THE CHALLENGE THE IMPACT THE CHALLENGE Every energy company has a goal to could affect drilling all these factors GPU rendering delivers 10x to 40x The data produced by seismic imaging analysis application, ffA users routinely Streamling with GPUs make the best of their large drilling should be taken into consideration, acceleration, increasing the massive surveys is crucial for identifying the achieve over an order of magnitude for better results budgets. Drilling wells in the right carefully processed and visualized with data crunching efficiency by an order presence of hydrocarbon reserves and speed-up compared with performance 40 place, and with accurate knowledge maximum detail. Working efficiently of magnitude. Users get a much more understanding how to extract it.Today, on high end multi-core CPUs. of the associated risks, really impacts to reduce project turnaround without realistic rendering in 3D Taking voxel geoscientists must process increasing THE IMPACT 30 the bottom line. When improved compromising quality has a definite visualization to the next level. Porting amounts of data as dwindling reserves accuracy leads to more pay zone and impact on customer performance. It the key processes to NVIDIA GPU require them to pinpoint smaller, more NVIDIA CUDA is allowing ffA to provide 20 less uncertainty about the production also implies that the data volumes are architecture has brought about a new complex reservoirs with greater speed scalable high performance computation process, the ROI improves significantly. growing and with the users’ need to see rendering process with significantly and accuracy. for seismic data on hardware platforms 10 more detail comes increasingly compute improved spatial perception and real- equipped with one or more NVIDIA Paradigm software aims at giving THE SOLUTION intensive data processing. time computation of key attributes. Quadro FX and NVIDIA Tesla GPUs customers as much information as Further integration of the CUDA – UK-based company ffA provides world significantly increasing the amount of 0 possible out of an existing seismic Quad-Core Tesla THE SOLUTION enabled code to the interpretation suite leading 3D seismic analysis software data that can be analyzed in a given CPU GPU + CPU dataset, where some contractors has produced a notable improvement and services to the global oil and timeframe and allowing them to may be advising the customer to The only adequate solution would be a The latest benchmarking results using Tesla GPUs in display quality, making it easier to gas industry. The sophisticated tools improve subsurface understanding and have produced performance improvements of up to shoot new data for better resolution. re-write of the most compute-intensive 37x versus high end dual quad core CPU . perceive depth and volume. providing a greater understanding of reduce risk in oil and gas exploration Tangible evidence for this is the ability parts of the rendering engine using complex 3D geology are extremely and exploitation. to drill more productive feet in a CUDA programming on NVIDIA GPUs to As a result VoxelGeo remains the compute-intensive. With the recently well (GeoSteering, accurate models, harness the compute power of massively reference voxel product and the product released CUDA enabled 3D seismic fine-tuned depth conversion, better parallel processors. The GPU-enabled of choice for industry leaders. characterization leading to drilling in solution should work effectively with sweet spots). both Windows and Linux. A minimal code-porting effort would provide a HSE is a critical topic, where nothing sizeable improvement in the overall Watch video*: can be left out. Avoiding overpressure efficiency shortening the overall project ffA and NVIDIA GPUs Revolutionize zones, identifying fracture areas that time as a result. the Search for New Oil and Gas Fields

* Use your phone, smartphone or tablet PC with QR reader software to read the QR code.

26 27 Revolutionizing high performance computing / NVIDIA® TESLATM

: D AN D e s tu d ie a s e

© P. Stroppa / CEA COMPUTING gpu C FINANCE S UPE R COMPUTING

GPU COMPUTING IN FINANCE – CASE STUDIES Driving the HPC revolution on Europe’s largest hybrid research clusters

Bloomberg: GPUs increase accuracy and reduce processing time for bond pricing BSC-CNS triples its calculation capacity

Bloomberg implemented an NVIDIA About BSC Tesla GPU computing solution in their The Barcelona Supercomputing Center Its advantage, compared to general- The mission of Barcelona datacenter. By porting their application – now has a new cluster with graphical purpose machines, is its speed and low Supercomputing Center is to research, to run on the NVIDIA CUDA parallel accelerators which will be used to energy consumption. Mateo Valero, BSC- develop and manage information processing architecture Bloomberg consolidate its research in programming CNS Director, says, “We currently have technology in order to facilitate scientific received dramatic improvements models, tool development and application some of the world’s best programming progress. With this objective, there has across the board. As Bloomberg porting. The new BSC-CNS cluster is models so we are in an optimal position been taken special dedication to research customers make crucial buying and a special-purpose supercomputer that to provide some specific applications the areas such as Computer Sciences, selling decisions, they now have access 48 GPUs 42x Lower Space 2000 CPUs offers an optimum peak performance for most efficient usage of the new system, Life Sciences, Earth Sciences and to the best and most current pricing specific applications or programs. offering a significant increase in their Computational Applications in Science information, giving them a serious $144K 28x Lower Cost $4 Million performance”. and Engineering. competitive trading advantage in a $1.2 Million / year market where timing is everything. $31K / year 38x Lower Power Cost

NVIDIA Tesla GPUs used by J.P. Morgan to Run Risk Calculations in Minutes, Not Hours GPU extension of the Curie supercomputer at genci

THE CHALLENGE THE SOLUTION THE IMPACT About GENCI Risk management is a huge and J.P. Morgan’s Equity Derivatives Group Utilizing GPUs has accelerated GENCI, Grand Equipement National de As part of its CURIE supercomputer that increasingly costly focus for the financial added NVIDIA® Tesla M2070 GPUs to its application performance by 40X and Calcul Intensif, is owned for 49 % by is currently being installed by Bull, GENCI services industry. A cornerstone of datacenters. More than half the equity delivered over 80 percent savings, the French State, for 20 % by CEA, 20 % has acquired 16 bullx blade chassis, J.P. Morgan’s cost-reduction plan derivative-focused risk computations run enabling greener data centers that by CNRS, 10 % by French Universities housing a total of 144 B505 accelerator to cut the cost of risk calculation by the bank have been moved to running deliver higher performance for the and 1% by INRIA. GENCI’s role is to blades, i.e. 288 NVIDIA Tesla M2090 GPUs involves accelerating its risk library. It on hybrid GPU/CPU-based systems, same power. For J.P. Morgan, this is set the strategic direction and to make (147,456 GPU cores). The 1.6-petaflops is imperative to reduce the total cost NVIDIA Tesla GPUs were deployed in game-changing technology, enabling the France’s most important investments in CURIE system will be entirely dedicated of ownership of J.P. Morgan’s risk- multiple data centers across the bank’s bank to calculate risk across a range of High-Performance Computing (HPC), in to the use of the European PRACE management platform and create a leap global offices. J.P. Morgan was able to products in a matter of minutes rather particular as the custodian of France’s partnership (the Partnership for Advanced forward in the speed with which client seamlessly share the GPUs between tens than overnight. Tesla GPUs give J.P. commitment to PRACE. Computing in Europe). requests can be serviced. of global applications. Morgan a significant market advantage.

28 29 Revolutionizing high performance computing / NVIDIA® TESLATM

Tesla Data Center Products Tesla Workstation Products Available from OEMs and certified Designed to deliver cluster-level resellers, Tesla GPU computing performance on a workstation, products are designed to supercharge the NVIDIA Tesla GPU Computing your computing cluster. Processors fuel the transition to parallel computing while making personal supercomputing possible — right at your desk. la gpu gpu te s la computing s s olution

TESLA GPU COMPUTING SOLUTIONS

The Tesla 20-series GPU computing typical quad-core CPUs, Tesla 20-series • Highly Reliable solutions are designed from the GPU computing products can deliver Uncompromised data reliability through ground up for high-performance equivalent performance at 1/10th the ECC protection and stress tested for computing and are based on NVIDIA’s cost and 1/20th the power consumption. zero error tolerance latest CUDA GPU architecture, code named “Fermi”. It delivers many “must • Superior Performance • Designed for HPC have” features for HPC including ECC memory for uncompromised Highest double precision floating For more information on Tesla GPU Tesla M2070/M2075/M2090 GPU Computing Tesla C2070/C2075 GPU Computing Processor point performance and large on-board computing products and applications, Modules enables the use of GPUs and CPUs delivers the power of a cluster in the form accuracy and scalability, C++ support, together in an individual server node or factor of a workstation. memory to support large HPC data sets visit www.nvidia.eu/tesla. and 7x double precision performance blade form factor. compared to CPUs. Compared to the

Highest Performance, Highest Efficiency Highest Performance, Highest Efficiency Workstations powered by Tesla GPUs outperform conventional CPU-only solution in life scince applications. GPU-CPU server solutions deliver up to 8x higher Linpack performance.

MIDG: Discontinuous AMBER Molecular FFT: cuFFT 3.1 in OpenEye ROCS Virtual Radix Sort Performance Performance/$ Performance/watt Galerkin Solvers for Dynamics (Mixed Double Precision Drug Screening CUDA SDK Gflops Gflops/$K Gflops/kwatt PDEs Precision) 750 8X 70 800 120 8 8 180 8 4X 7 7 160 7 60 100 600 140 600 6 6 6 50 5X 80 120 5 5 5 450 40 100 400 4 4

S peedups 4 60 30 80 300 3 3 3 40 60 20 200 2 2 40 2 150 10 20 1 1 20 1 0 CPU Server GPU-CPU 0 CPU Server GPU-CPU 0 CPU Server GPU-CPU Server Server Server 0 0 0 0 0 CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz, 48 GB memory, $7K, 0.55 kw Intel Xeon X5550 CPU Tesla C2050 GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1.0 kw

30 31 Revolutionizing high performance computing / NVIDIA® TESLATM

Tesla C-Class GPU Computing Processors The power inside personal MULTI-TERAFLOPS SYSTEMs : : NVIDIA CUDA Technology Your Own Supercomputer Accessible to Everyone Unlocks the Power of Get nearly 4 teraflops of compute Available from OEMs and resellers Tesla’s Many Cores capability and the ability to perform worldwide, the Tesla Personal The CUDA C programming environment computations 250 times faster than a Supercomputer operates quietly and simplifies many-core programming and multi-CPU core PC or workstation. plugs into a standard power strip so computing gpu s s olution c te s la enhances performance by offloading you can take advantage of cluster level computationally-intensive activities NVIDIA CUDA Unlocks the Power performance anytime you want, right from the CPU to the GPU. It enables of GPU parallel computing from your desk. developers to utilise NVIDIA GPUs to The CUDA parallel computing solve the most complex computation PETASCALE COMPUTING WITH architecture enables developers to intensive challenges such as protein TERAFLOP PROCESSORS utilise C or FORTRAN programming docking, molecular dynamics, financial with NVIDIA GPUs to run the most The NVIDIA Tesla computing card analysis, fluid dynamics, structural complex computationally intensive enables the transition to energy analysis and many others. applications. CUDA is easy to learn and efficient parallel computing power by The NVIDIA® Tesla™ Personal has become widely adopted by thousands bringing the performance of a small Supercomputer is based on the of application developers worldwide cluster to a workstation. With hundreds revolutionary NVIDIA® CUDA™ parallel to accelerate the most performance of processor cores and a standard C computing architecture and powered by demanding applications. compiler that simplifies application up to thousands of parallel processing development, Tesla cards scale to solve cores. the world’s most important computing challenges more quickly and accurately. tesla PERSONAL SUPERCOMPUTER Tesla C2070 Tesla C2075* Cluster Performance The performance of a cluster in a desktop system. Four Tesla GPU computing on your Desktop processors deliver nearly 4 teraflops of performance. Architecture Tesla 20-series GPU

Number of Cores 448 Designed for Office Use Plugs into a standard office power socket and quiet enough for use at your desk. 64 KB L1 cache + Shared Memory / Caches Massively Parallel Many Core GPU 448 parallel processor cores per GPU that can execute thousands of 32 cores, 768 KB L2 cache 1,03 TFlops (single) architecture concurrent threads. FP Peak Performance 515 GFlops (double) High-Speed Memory per GPU Dedicated compute memory enables larger datasets to be stored locally for FP Application Efficiency 1.5 - 2 (single) (Tesla C1060 reference) 3 - 4 (double) each processor minimizing data movement around the system. 6 GB GDDR5 ECC GPU Memory 5.25 GB with ECC on IEEE 754 Floating Point Precision Provides results that are consistent across platforms and meet industry standards. 144 GB/s ECC off Memory Bandwidth (single -precision and 115 GB/s ECC on

double -precision ) Video Output DVI-I

PCIe x16 Gen2 (bi-directional System I/O async. transfer)

Best price/performance solutions for double Positioning precision codes and when ECC memory required

Power Consumption 225W 200W

* Extended ECC reporting and Dynamic power scaling capability available.

32 33 Revolutionizing high performance computing / NVIDIA® TESLATM

Tesla M2070 Tesla M2075* Tesla M2090

Architecture Tesla 20-series GPU : :

Number of Cores 448 512

64 KB L1 cache + Shared Memory / Caches 32 cores, 768 KB L2 cache

1030 GFlops (single) 1331 GFlops (single) FP Peak Performance

515 GFlops (double) 665 GFlops (double) computing gpu s s olution mo d ule s te s la

FP Application Efficiency 1.5 - 2 (single) (Tesla C1060 reference) 3 - 4 (double)

6 GB GDDR5 ECC 6 GB GDDR5 ECC GPU Memory 5,25 GB with ECC on 5,25 GB with ECC on

144 GB/s ECC off Memory Bandwidth 115 GB/s ECC on

Best price/performance solutions for double precision codes Positioning and when ECC memory required

Power Consumption 225W 200W 225W

* Extended ECC reporting and Dynamic power scaling capability available.

tesla M-Class GPU Computing MODULES NVIDIA tesla and cuda Links DESIGNED FOR datacenter INTEGRATION

Tesla M-class GPU Computing Modules Endorsed by the HPC OEMS The Tesla 20-series GPU Computing NEWS AND SUCCESS STORIES Development tools HARDWARE enable the seamless integration of GPU Processors are the first to deliver The Tesla Modules are only available in NVIDIA GPU Computing on Twitter CUDA Zone NVIDIA High Performance Computing computing with host systems for high- greater than 10X the double precision OEM systems that have been specifically http://twitter.com/gpucomputing www.nvidia.co.uk/cuda www.nvidia.eu/tesla performance computing and large data horsepower of a quad-core x86 CPU and designed and qualified along with center, scale-out deployments. the first GPUs to deliver ECC memory. CUDA weekly newsletter CUDA in Action Personal Supercomputers with Tesla NVIDIA engineers. They are designed Based on the Fermi architecture, www.nvidia.com/object/cuda_week_in_ www.nvidia.co.uk/object/cuda_in_ www.nvidia.co.uk/psc for maximum reliability; their passive review_newsletter.html action.html System Monitoring Features these GPUs feature up to 665 gigaflops Tesla Data Center Solutions heatsink design eliminates moving parts of double precision performance, News and articles CUDA Books www.nvidia.co.uk/page/preconfigured_ Tesla Modules deliver all the standard and cables. Tesla Modules have been 1 teraflops of single precision www.nvidia.co.uk/page/tesla-articles.html www.nvidia.co.uk/object/cuda_books.html clusters.html benefits of GPU computing while selected by the most important HPC performance, ECC memory error enabling maximum reliability and tight vendors world-wide. Tesla Video on YouTube CUDA Training & Consulting Tesla Products Technical Descriptions protection, and L1 and L2 caches. integration with system monitoring www.youtube.com/nvidiatesla www.nvidia.co.uk/page/cuda_ www.nvidia.co.uk/page/tesla_product_ Tesla GPU’s high performance makes and management tools. This gives data consultants.html literature.html them ideal for seismic processing, Success stories centre IT staff much greater choice in TRY THE TESLA SIMCLUSTER biochemistry simulations, weather and www.nvidia.co.uk/page/tesla_ Order CUDA course how they deploy GPUs, offering a wide climate modeling, signal processing, A ready to use solution testimonials.html www.parallel-compute.com variety of rack-mount and blade systems computational finance, CAE, CFD, and for science and research. plus the remote monitoring and remote Software Development Tools data analystics. www.nvidia.eu/cluster management capabilities they need. www.nvidia.co.uk/object/tesla_ software_uk.html

34 35 © 2011 NVIDIA Corporation. All rights reserved. NVIDIA, the NVIDIA logo, NVIDIA Tesla, CUDA, GigaThread, Parallel DataCache and Parallel NSight are trademarks and/or registered trademarks of NVIDIA Corporation. All company and product names are trademarks or registered trademarks of the respective owners with which they are associated. Features, pricing, availability, and specifications are all subject to change without notice.