Background on GPGPU Programming • Development of powerful 3D graphics chips – Consumer demand was accelerated by immersive first person games such as Doom and Quake – Microsoft DirectX 8.0 standard released in 2001 required hardware to handle both programmable vertex and programmable pixel shading standards – This meant developers had some control over computations on a GPU – The GeForce 3 series were examples of such chips

Early Attempts at GPGPU Programming • Calculated a color value for every pixel on screen – Used (x,y) position, input colors, textures, etc. – Input color and textures were controlled by programmer – Any data value could be used as input and, after some computation, the final data value could be used for any purpose, not just the final pixel color • Restrictions that would prevent widespread use – Only a handful of input color and texture values could be used; there were restrictions on where results could be read from and written to memory – There was no reasonable way to debug code – Programmer would need to learn OpenGL or DirectX NVDIA CUDA Architecture • Introduced in November 2006 – First used on GeForce 8800 GTX – Introduced many features to facilitate GP programming • Some Features – Introduced a unified pipeline that allowed every ALU to be used for GP programming – Complied with IEEE single-precision floating point – Allowed arbitrary read and write access to memory via shared memory – Developed a free compiler, CUDA C, based on the widely used C language Selected Applications - 1 • Medical Imaging – Convert an ultrasound image based on a 15 minute scan (about 35GB) into a 3D image a doctor can manipulate – Takes about 20 minutes using two Tesla C1060 cards • Computational Fluid Dynamics – Simulating the flow of air or fluid around rotors or blades requires a long time on a massive supercomputer – Using CUDA C this was reduced to near real time feedback – This new capability has greatly affected research techniques Selected Applications - 2 • Environmental Science – Surfacants allow products to clean well but cause environmental damage; testing new products in the lab was slow and expensive – Protor and Gamble introduced a Highly Optimized Object-oriented Many-Particle Dynamics simulator with two Tesla cards with performance similar to IBM BlueGene computer with 1024 CPUs • The World’s Fastest Supercomputer (as of 2010) – Tianhe-1 in China performs at 2.566 petaFLOPS – It has 14,336 Xeon X5670 front end processors and 7168 Tesla M2050 cards with 448 cores each Development Environment • Some prerequisites – A CUDA-enabled graphics processor – An NVIDIA device driver (comes with the card) – A CUDA development toolkit – A standard C compiler CUDA-enabled GPUs - 1 CUDA-enabled GPUs - 2 CUDA-enabled GPUs - 3

• In other words, all recent NVIDIA graphics and Tesla cards CUDA Development Toolkit • Go to the CUDA download page Standard C Compiler • Windows – Microsoft Visual Studio C compiler – If the full Visual Studio is not available, use the free Express edition • Linux – use gcc, it is available on

• Mac OS X 10.5.7 or higher, the Apple Developers Connection has downloads for gcc and Xcode