The Intro to GPGPU CPU Vs

12/12/11! The Intro to GPGPU . Dr. Chokchai (Box) Leangsuksun, PhD! Louisiana Tech University. Ruston, LA! ! CPU vs. GPU • CPU – Fast caches – Branching adaptability – High performance • GPU – Multiple ALUs – Fast onboard memory – High throughput on parallel tasks • Executes program on each fragment/vertex • CPUs are great for task parallelism • GPUs are great for data parallelism Supercomputing 20082 Education Program 1! 12/12/11! CPU vs. GPU - Hardware • More transistors devoted to data processing CUDA programming guide 3.1 3 CPU vs. GPU – Computation Power CUDA programming guide 3.1! 2! 12/12/11! CPU vs. GPU – Memory Bandwidth CUDA programming guide 3.1! What is GPGPU ? • General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007! ECE 498AL, University of Illinois, Urbana-Champaign! 3! 12/12/11! Why is GPGPU? • Large number of cores – – 100-1000 cores in a single card • Low cost – less than $100-$1500 • Green computing – Low power consumption – 135 watts/card – 135 w vs 30000 w (300 watts * 100) • 1 card can perform > 100 desktops 12/14/09!– $750 vs 50000 ($500 * 100) 7 Two major players 4! 12/12/11! Parallel Computing on a GPU • NVIDIA GPU Computing Architecture – Via a HW device interface – In laptops, desktops, workstations, servers • Tesla T10 1070 from 1-4 TFLOPS • AMD/ATI 5970 x2 3200 cores • NVIDIA Tegra is an all-in-one (system-on-a-chip) ATI 4850! processor architecture derived from the ARM family • GPU parallelism is better than Moore’s law, more doubling every year • GPGPU is a GPU that allows user to process both graphics and non-graphics applications. GeForce 8800! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Requirements of a GPU system • GPGPU is a GPU that allows user to process both graphics and non-graphics applications. • GPGPU-capable video card • Power supply • Cooling Tesla D870! • PCI-Express 16x GeForce 8800! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 5! 12/12/11! Examples of GPU devices © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007! ECE 498AL, University of Illinois, Urbana-Champaign! NVIDIA GeForce 8800 (G80) • the eighth generation of NVIDIA’s GeForce graphic cards. • High performance CUDA-enabled GPGPU • 128 cores • Memory 256-768 MB or 1.5 GB in Tesla • High-speed memory bandwidth (86.4GB/s) • Supports Scalable Link Interface (SLI) 6! 12/12/11! NVIDIA GeForce 295(G200) • the tenth generation of NVIDIA’s GeForce graphic cards. • The second generation of CUDA architecture. • Dual GPU card. • 480 cores. (240 per GPU ) • 1242 Mhz processor clock speed. • Memory 1792 MB. (896 MB per GPU) • 223.8 GB/s memory bandwidth. (2 memory interfaces) • Supports Quad Scalable Link Interface (SLI) NVIDIA GeForce 480(Fermi) • the elevenths generation of NVIDIA’s GeForce graphic cards. • The third generation of CUDA architecture. • 480 cores. • 1401 Mhz processor clock speed • Memory 1536 MB. • 177.4 GB/s memory bandwidth • Supports 2way/3way Scalable Link Interface (SLI) 7! 12/12/11! NVIDIA TeslaTM • Feature – GPU Computing for HPC – No display ports – Dedicate to computation – For massively Multi-threaded computing – Supercomputing performance – Large memory capacity up to 6GB in Tesla M2070 NVIDIA Tesla Card >> • Tesla 10:! • C-Series(Card) = 1 GPU with 1.5 GB! • D-Series(Deskside unit) = 2 GPUs! • S-Series(1U server) = 4 GPUs! • Tesla 20 (Fermi architecture) = 1GPU with 3GB or 6GB! ! • Note: 1 G80 GPU = 128 cores = ~500 GFLOPs! • 1 T10 = 240 cores = 1 TFLOPs ! << NVIDIA G80 8! 12/12/11! NVIDIA Fermi (Tesla seris 20) “I believe history will record Fermi • 512 cores (16 SM * 32 cores) as a significant milestone. ” ! Dave Patterson! • 8X faster peak DP floating point calculation. • 520-630 GFLOPS DP • 3GB-GDDR5 for Tesla 2050 • 6GB-GDDR5 for Tesla 2070 • ECC • L1 and L2 cache • Concurrent Kernels Executions (up to 16 kernels) • IEEE754-2008 and FMA Fused Multiply-Add NVidia Fermi Architecture NVIDIA's Fermi white paper! 9! 12/12/11! 3rd Generation SM Architecture •32 cores, 16 load/store registers, and 4 Special Function Unites.! ! •Customized 64KB memory 16KB Shared memory and 48KB L1 Cache, Or 48KB Shared memory and 16KB L1 Cache.! ! • dual warp scheduler.! ! •Each CUDA Core contain one Floating Point Unit and one Integer ALU, With DB support.! •8X faster in double precession operations than GT200.! NVIDIA's Fermi white paper! Memory Hierarchy Each Thread in a block can access the shared memory and the L1 Cache, Each block has the access to the L2 cache and the Global memory. ! NVIDIA's Fermi white paper! 10! 12/12/11! Dual Warp Scheduler >>! << Concurrent Kernel Execution! NVIDIA's Fermi white paper! Fermi Products GTX460 GTX465 GTX470 GTX480 Tesla2050 Tesla2070 Cores 336 352 448 480 448 448 Clock SP:1.05TFLOPS 1350MHz 1215MHz 1215MHz 1401MHz Speed DP: 515 GFLOPS 768MB or Memory 1GB 1280 MB 1.5 GB 3 GB 6GB 1GB 86.4 or bandwidth 102.6 133.9 177.4 148 148 115.2 Power 160W 200W 215W 250W 225W 225W Price $199-$249 $299 $349 $499 $2,499 $3,999 NVIDIA.com! 11! 12/12/11! CUDA Architecture Generations The linked image cannot be displayed. The file may have been moved, renamed, or deleted. Verify that the link points to the correct file and location. Nvidia's Fermi white paper! Fermi VS GT200 Each SM in fermi architecture can do 16 FMA (Fused Multiply- Add) double precision operation per clock cycle. ! Nvidia's Fermi white paper! 12! 12/12/11! Nvidia's Fermi white paper! This slide is from NVDIA CUDA tutorial! © David Kirk/ NVIDIA and Wen- mei W. Hwu, 2007! ECE 498AL, University of Illinois, Urbana-Champaign! 13! 12/12/11! ATI Stream (1) 12/14/09 27 ATI 4870 12/14/09 28 14! 12/12/11! ATI 4870 X2 12/14/09 29 ATI Radeon™ HD 5870 Transistors 2.15 billion (40nm) Stream Cores 1600 Clock speed 850 MHz SP Compute Power 2.72 TeraFLOPS DB Compute Power 544 GigaFLOPS Memory Type GDDR5 4.8Gbps Memory Capacity 1 GB Memory Bandwidth 153.6 GB/sec Board Power 188w max / 27w idle AMD.com! 15! 12/12/11! ATI Radeon™ HD 5970 Transistors 4.3 billion (40nm) Stream Cores 3200 (2 GPUs) Clock speed 725 MHz SP Compute Power 4.64 TeraFLOPS DB Compute Power 928 GigaFLOPS Memory Type GDDR5 4.0Gbps Memory Capacity 2 - 4 GB Memory Bandwidth 256.0 GB/sec Board Power 294w max / 51w idle AMD.com! ! Architecture of ATI Radeon 4000 series 16! 12/12/11! This slide is from ATI presentation! This slide is from ATI presentation! 17! 12/12/11! What about Intel ?? Intel Larrabee • a hybrid between a multi-core CPU and a GPU, • coherent cache hierarchy and x86 architecture compatibility are CPU-like • its wide SIMD vector units and texture sampling hardware are GPU-like. 18! 12/12/11! Months after ISC’09, Intel canceled the larrabee project! ! In ISC’10 they announced new project code name “Night Ferry” using a similar architecture to larrabee called MIC! Intel Night Ferry (MIC Architecture) • 22 nm technology • 32 cores 1.2Ghz ( MIC is up to 50 cores) • 128 threads at 4threads/core. • 8MB shared coherent cache • 1-2GB GDDR5 • Intel HPC tools This slide information from ISC’10 Skaugen_keynote ! 19! 12/12/11! MIC Architecture (Many Integrated Core) This slide information from ISC’10 Skaugen_keynote ! Intel Night Ferry VS NVIDIA Fermi ! !!!!!!! Intel MIC ! ! !NVIDIA Fermi !! MIMD Parallelism ! !!!32 ! !!!!32(28) !! SIMD Parallelism! !!!!16 ! !!!! 16 !! Instruction-Level Parallelism! ! 2 ! !!!! 1!! Thread Granularity ! !!!coarse ! !! fine !! Multithreading ! !!!! 4 ! !! ! 24 !! Clock ! !!!!!! 1.2GHz! ! ! 1.1GHz!! L1 cache/processor ! !!!32KB ! !!!64KB !! L2 cache/processor ! !!!256KB ! !!!24KB !! programming model! !!posix threads ! !CUDA kernels!! virtual memory ! !!!!yes! !!!! no!! memory shared with host ! ! no! !!! no !! hardware parallelism support ! no! !!!! yes !! mature tools ! !!!!!yes! !!!! yes !! This information from the Article “Compiler and more: Night Ferry Versus Fermi” by Michael Wolf. Portland group inc.! 20! 12/12/11! Introduction to ! Open! CL! Toward new approach in Computing! Introduction to openCL • OpenCL stands for Open Computing Language. • It is from consortium efforts such as Apple, NVDIA, AMD etc. • The Khronos group who was responsible for OpenGL. • Toke 6 months to come up with the specifications. 21! 12/12/11! OpenCL 1. Royalty-free. 2. Support both task and data parallel programing modes. 3. Works for vendor-agnostic GPGPUs 4. including multi cores CPUs 5. Works on Cell processors. 6. Support handhelds and mobile devices. 7. Based on C language under C99. 22! 12/12/11! OpenCL Platform Model Basic OpenCL program Structure 1. OpenCL Kernel 2. Host program containing: a. Devices Context. b. Command Queue c. Memory Objects d. OpenCL Program. e. Kernel Memory Arguments. 23! 12/12/11! CPUs+GPU platforms 12/14/09 47 Performance of GPGPU Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS! 24! 12/12/11! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007! ECE 498AL, University of Illinois, Urbana-Champaign! 25! 12/12/11! CUDA • “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor • Targeted software stack – Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU – Standalone Driver - Optimized for computation – Interface designed for compute - graphics free API – Data sharing with OpenGL buffer objects – Guaranteed maximum download & readback speeds – Explicit GPU memory management © David Kirk/NVIDIA and Wen-mei W.

The Intro to GPGPU CPU Vs

CUDA by Example

System-On-A-Chip (Soc) & ARM Architecture

Comparative Study of Various Systems on Chips Embedded in Mobile Devices

System-On-Chip Design with Virtual Components

Lecture Notes

NVIDIA Quadro Technical Specifications

EE Concentration: System-On-A-Chip (Soc)

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

3Dfx Oral History Panel Gordon Campbell, Scott Sellers, Ross Q. Smith, and Gary M. Tarolli

GV-3D1-7950-RH Geforce™ 7950 GX2 Graphics Accelerator

Club 3D Geforce 6800 GS Pcie Brute Rendering Force

Chapter 5: Asics Vs. Plds