CUDA and Opencl

CUDACUDA andand OpenCLOpenCL ------ DevelopmentDevelopment InterfacesInterfaces forfor MulticoreMulticore ProgrammingProgramming Dr. Jun Ni, Ph.D. Associate Professor of Radiology, Biomedical Engineering, Mechanical Engineering, Computer Science The University of Iowa, Iowa City, Iowa, USA Dec. 23 Harbin Engineering University OutlineOutline CUDA,CUDA, NavidaNavida--basedbased ProgrammingProgramming environmentenvironment forfor GPGPUGPGPU OpenCLOpenCL,, openopen sourcesource programmingprogramming environmentenvironment forfor multicoremulticore processorprocessor systemssystems forfor Cell/BECell/BE andand GPUGPU IntroductionIntroduction toto CUDACUDA CUDACUDA (an(an acronymacronym forfor ComputeCompute UnifiedUnified DeviceDevice ArchitectureArchitecture)) a parallel computing architecture developed by NVIDIA CUDACUDA isis thethe computingcomputing engineengine inin NVIDIANVIDIA graphicsgraphics processingprocessing unitsunits ((GPUsGPUs)) accessible to software developers through industry standard programming languages ProgrammersProgrammers useuse 'C'C forfor CUDA'CUDA' (C(C withwith NVIDIANVIDIA extensions)extensions) compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU IntroductionIntroduction toto CUDACUDA CUDA architecture supports a range of computational interfaces OpenCL DirectCompute Third party wrappers Python Fortran Java Matlab The latest drivers all contain the necessary CUDA components CUDA works with all NVIDIA GPUs G8X series onwards: GeForce Quadro Tesla IntroductionIntroduction toto CUDACUDA CUDACUDA givesgives developersdevelopers accessaccess toto thethe nativenative instructioninstruction setset andand memorymemory ofof thethe parallelparallel computationalcomputational elementselements inin CUDACUDA GPUsGPUs.. IntroductionIntroduction toto CUDACUDA UsingUsing CUDA,CUDA, thethe latestlatest NVIDIANVIDIA GPUsGPUs effectivelyeffectively becomebecome openopen architecturesarchitectures likelike CPUsCPUs.. UnlikeUnlike CPUs,CPUs, GPUsGPUs havehave aa parallelparallel "many"many--core"core" architecture,architecture, eacheach corecore capablecapable ofof runningrunning thousandsthousands ofof threadsthreads simultaneouslysimultaneously -- ifif anan applicationapplication isis suitedsuited toto thisthis kindkind ofof anan architecture,architecture, thethe GPUGPU cancan offeroffer largelarge performanceperformance benefitsbenefits MultiMulti--threadsthreads computingcomputing becomesbecomes moremore efficientefficient withinwithin multicoremulticore architecturearchitecture IntroductionIntroduction toto CUDACUDA InIn thethe computercomputer gaminggaming industry,industry, inin additionaddition toto graphicsgraphics rendering,rendering, graphicsgraphics cardscards areare usedused inin gamegame physicsphysics calculationscalculations (physical(physical effectseffects likelike debris,debris, smoke,smoke, fire,fire, fluids)fluids) PhysX Bullet CUDACUDA hashas alsoalso beenbeen usedused toto accelerateaccelerate nonnon--graphicalgraphical applicationsapplications byby anan orderorder ofof magnitudemagnitude oror moremore computational biology Cryptography Other fields IntroductionIntroduction toto CUDACUDA Example:Example: BOINC distributed computing client CUDACUDA providesprovides bothboth aa lowlow levellevel APIAPI andand aa higherhigher levellevel APIAPI TheThe initialinitial CUDACUDA SDKSDK waswas mademade publicpublic onon 1515 FebruaryFebruary 2007,2007, forfor MicrosoftMicrosoft WindowsWindows andand LinuxLinux MacMac OSOS XX supportsupport waswas laterlater addedadded inin versionversion 2.0.2.0. whichwhich supersedessupersedes thethe betabeta releasedreleased FebruaryFebruary 14,14, 20082008 AdvantagesAdvantages CUDACUDA hashas severalseveral advantagesadvantages overover traditionaltraditional generalgeneral purposepurpose computationcomputation onon GPUsGPUs (GPGPU)(GPGPU) usingusing graphicsgraphics APIs.APIs. Scattered reads – code can read from arbitrary addresses in memory. Shared memory – CUDA exposes a fast shared memory region (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higherhigher bandwidth than is possible using texture lookups Faster downloads and readbacks to and from the GPU Full support for integer and bitwise operations, including integer texture lookups. LimitationsLimitations andand ChallengesChallenges ItIt usesuses aa recursionrecursion--free,free, functionfunction--pointerpointer--freefree subsetsubset ofof thethe CC language,language, plusplus somesome simplesimple extensions.extensions. However,However, aa singlesingle processprocess mustmust runrun spreadspread acrossacross multiplemultiple disjointdisjoint memorymemory spaces,spaces, unlikeunlike otherother CC languagelanguage runtimeruntime environments.environments. TextureTexture renderingrendering isis notnot supported.supported. ForFor doubledouble precisionprecision (only(only supportedsupported inin newernewer GPUsGPUs likelike GTXGTX 260260[12][12])) therethere areare somesome deviationsdeviations fromfrom thethe IEEEIEEE 754754 standard:standard: roundround--toto--nearestnearest--eveneven isis thethe onlyonly supportedsupported roundingrounding modemode forfor reciprocal,reciprocal, division,division, andand squaresquare root.root. LimitationsLimitations andand ChallengesChallenges InIn singlesingle precisionprecision,, DenormalsDenormals andand signallingsignalling NaNsNaNs areare notnot supportedsupported OnlyOnly twotwo IEEEIEEE roundingrounding modesmodes areare supportedsupported (chop(chop andand roundround--toto--nearestnearest even)even) ThoseThose areare specifiedspecified onon aa perper--instructioninstruction basisbasis ratherrather thanthan inin aa controlcontrol wordword PrecisionPrecision ofof division/squaredivision/square rootroot isis slightlyslightly lowerlower thanthan singlesingle precision.precision. LimitationsLimitations andand ChallengesChallenges TheThe busbus bandwidthbandwidth andand latencylatency betweenbetween thethe CPUCPU andand thethe GPUGPU maymay bebe aa bottleneck.bottleneck. ThreadsThreads shouldshould bebe runningrunning inin groupsgroups ofof atat leastleast 3232 forfor bestbest performance,performance, withwith totaltotal numbernumber ofof threadsthreads numberingnumbering inin thethe thousands.thousands. LimitationsLimitations andand ChallengesChallenges BranchesBranches inin thethe programprogram codecode dodo notnot impactimpact performanceperformance significantly,significantly, providedprovided thatthat eacheach ofof 3232 threadsthreads takestakes thethe samesame executionexecution path;path; thethe SIMDSIMD executionexecution modelmodel becomesbecomes aa significantsignificant limitationlimitation forfor anyany inherentlyinherently divergentdivergent tasktask (e.g.(e.g. traversingtraversing aa spacespace partitioningpartitioning datadata structurestructure duringduring raytracingraytracing).). LimitationsLimitations andand ChallengesChallenges UnlikeUnlike OpenCLOpenCL,, CUDACUDA--enabledenabled GPUsGPUs areare onlyonly availableavailable fromfrom NVIDIANVIDIA ((GeForceGeForce 88 seriesseries andand above,above, QuadroQuadro andand TeslaTesla).). AMD/ATIAMD/ATI onlyonly supportssupports GPGPUGPGPU onon itsits highhigh-- endend chipschips ExampleExample C++C++ loadsloads aa texturetexture fromfrom anan imageimage intointo anan arrayarray onon thethe GPU:GPU: cudaArray* cu_array; texture<float, 2> tex; // Allocate array cudaChannelFormatDesc description = cudaCreateChannelDesc<float>(); cudaMallocArray(&cu_array, &description, width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height*sizeof(float), cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTextureToArray(tex, cu_array); // Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel<<< gridDim, blockDim, 0 >>>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y *blockDim.y + threadIdx.y ; float c = tex2D(tex, x, y); odata[y*width+x ] = c; } ExampleExample inin Python:Python: thethe productproduct ofof twotwo arraysarrays onon thethe GPUGPU import pycuda.driver as drv import numpy import pycuda.autoinit mod = drv.SourceModule(""" __global__ void multiply_them(float *dest, float *a, float *b) { const int i = threadIdx.x; dest[i ] = a[i] * b[i]; } """) multiply_them = mod.get_function("multiply_them") a = numpy.random .randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) dest = numpy.zeros_like(a) multiply_them ( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1)) print dest-a*b ExampleExample inin Python:Python: matrixmatrix multiplicationmultiplication inin GPUGPU importimport numpynumpy fromfrom pycublaspycublas importimport CUBLASMatrixCUBLASMatrix AA == CUBLASMatrixCUBLASMatrix (( numpy.mat([[1,2,3],[4,5,6]],numpy.float32)numpy.mat([[1,2,3],[4,5,6]],numpy.float32) )) BB == CUBLASMatrixCUBLASMatrix(( numpy.mat([[2,3],[4,5],[6,7]],numpy.float32)numpy.mat([[2,3],[4,5],[6,7]],numpy.float32) )) CC == A*BA*B printprint C.np_matC.np_mat()() LanguageLanguage BindingsBindings UnlikeUnlike OpenCLOpenCL,, CUDACUDA--enabledenabled GPUsGPUs areare onlyonly availableavailable fromfrom NVIDIANVIDIA ((GeForceGeForce 88 seriesseries andand above,above, QuadroQuadro andand TeslaTesla).). AMD/ATIAMD/ATI onlyonly supportssupports

CUDA and Opencl

What Is Stream Processing?

Rapidmind: Portability Across Architectures and Its Limitations

Data-Parallel Programming with Intel Array Building Blocks (Arbb)

A Rough Guide to Scientific Computing on The

A Rough Guide to Scientific Computing on the Playstation 3

Parallel Programming Models

On the Effective Parallel Programming of Multi-Core Processors

Data-Parallel Programming with Intel Array Building Blocks (Arbb)

The Playstation 3 for High Performance Scientific Computing

Frameworks for Multi-Core Architectures: a Comprehensive Evaluation Using 2D/3D Image Registration

Software Plattform Embedded Systems 2020

Frameworks for GPU Accelerators: a Comprehensive Evaluation Using 2D/3D Image Registration