GpuCV: A GPU-accelerated framework for image processing and Computer Vision

Y. Allusse, P. Horain Outline

Why accelerating Computer Vision and Image Processing? How can GPUs help? GpuCV description Results Future works Conclusion

page 2 6 mai 2010 Y. Allusse, P. Horain Motivation statement

Why accelerating Computer Vision and Image Processing? Processing large images & HD videos

Increasing data weight • Microscopy • Satellite images • Fine arts, printing • Video databases

 Images up to 100 GBytes !

page 4 6 mai 2010 Y. Allusse, P. Horain Real time computer vision

Interactive applications, security… • Biometry • Multimodal applications • 3D motion tracking P A B /

4

G

3D/2D registration E P M page 5 6 mai 2010 Y. Allusse, P. Horain Example application: Virtual conference s o m e d

m o c . D 3 g o l B y M / / : p t t h

page 6 Y. Allusse, P. Horain Example application: Virtual conference s o m e d

m o c . D 3 g o l B y M / / : p t t h

page 7 Y. Allusse, P. Horain How to accelerate image processing?

Available technologies for acceleration: • Extended instruction sets (ex.: Intel OpenCV / IPP) • Multiple CPU cores (ex.: IBM Cell) • Co-processor:

- FPGA (field-programmable gate array)

- GPU (graphical processing unit)

 GPUs available (almost) for free in consumer PCs

page 8 6 mai 2010 Y. Allusse, P. Horain Graphics Processing Units GPU: what?

Designed for advanced rendering • Multi-texturing effects. • Realistic lights and shadows effects. • Post processing visual effects. Originally in consumer PCs for gaming

Image rendering device :  Highly parallel processor  High bandwidth memory

page 10 6 mai 2010 Y. Allusse, P. Horain GPU history: The power race t a . z a r g u t . g c i . n o i s i v 4 u p g / / : p t t h

, n o i s i V 4 U P G

: e c r u o S

page 11 6 mai 2010 Y. Allusse, P. Horain GPU vs CPU: Specifications ]

Effective Nbr. of u Avg Price in € Millions of j a

Model processing Power in Watts processing r (12/01/2008)‏ transistors a d n

power in Gflops units i v o G

a g a N

:

e c r

NVIDIA u 300 336 110 754 112 o S

GeForce 8800 GT [

ATI 300 475 215 700 320 HD 2900 XT

Intel 169 17 65 291 2 Core 2 Duo E6700

AMD 168 19 125 227,4 2 64 x2 6000+

page 12 6 mai 2010 Y. Allusse, P. Horain GPU vs. CPU

Processing power ratio on price and power consumption

gFlops per €

4

Different purpose

2 Different architecture

Different efficiency 0

gFlops per Watts gFlops

NVIDIA GeForce 8800 GT ATI Radeon HD 2900 XT Intel Core 2 Duo E6700 AMD 64 x2 6000+

page 13 6 mai 2010 Y. Allusse, P. Horain GPU history: towards programming flexibility

Until 2000: fixed architecture (not programmable) 2000-01: Pixel and Vertex GeForce 3 and ATI R200. 2005: Geometry shaders GeForce 6800 and ATI Radeon X800. 2006: ATI CTM™ (for "Close To Metal"). ATI Radeon based GPUs. 2007: NVIDIA CUDA, ATI Stream SDK NVIDIA GeForce 8, AMD ATI FireStream. 2009 (Q3): OpenCL drivers & SDK available Apple Inc., AMD, IBM, Intel, Nvidia… page 14 6 mai 2010 Y. Allusse, P. Horain GPU pipeline

Shaders allow application to run their own code in the graphic pipeline

page 15 6 mai 2010 Y. Allusse, P. Horain CUDA thread batching

GPU processing library from NVIDIA. Subdivide processing tasks in thousands of threads using blocks. C Style programming Full memory access

page 16 6 mai 2010 Y. Allusse, P. Horain Cuda memory model

Threads share memory => Synchronization mechanism Cache memory really fast (Shared Memory, Registers, Local Memory)‏ Texture and constant Memory are fast but READ ONLY. Global Memory is slower but WRITABLE.

page 17 6 mai 2010 Y. Allusse, P. Horain GPU vs. CPU(2)‏

Benefits of GPU: • Processing power:

- increasing faster than CPU,

- cheaper than CPU,

- highly parallel, • Easily upgradable.

Benefits of CPU: • Flexible and general processing unit, • Stable programming languages.

page 18 6 mai 2010 Y. Allusse, P. Horain GPU programming challenges

Algorithms • Highly parallel • Coding limitations Dedicated APIs • OpenGL, shading languages, • Brook, CUDA, OpenCL… Development tools • Rapidly evolving APIs • Heterogeneous and scattered documentation  GPU complexity to be hidden for wide acceptance

page 19 6 mai 2010 Y. Allusse, P. Horain Open GPU libraries for image processing

• OpenVIDIA: GPU Computer Vision (since 2002) http://openvidia.sourceforge.net • NVPP: NVIDIA Performance Primitives http://www.nvidia.com/nvpp • GPU4Vision: research in GPU-based vision algorithms http://gpu4vision.icg.tugraz.at • …and many more http://www.nvidia.com/cuda...

 Collections of image processing primitives and algorithms  Basic framework

page 20 6 mai 2010 Y. Allusse, P. Horain GpuCV

Overview GpuCV in a few words:

Aim: Accelerate computer vision with GPUs Initiated in 2005‏ ~ 2.5 person∙years effort Publications in major international conferences: • ACM Multimedia 08 (Open Source competition) • ISCV08 (Int. Symposium on Visual Computing) • IEEE ICME06 (Int. Conf. on Multimedia & Expo) Open source: picoforge.int-evry.fr/projects/gpucv ~ 100 visitors worldwide daily

page 22 6 mai 2010 Y. Allusse, P. Horain GpuCV: Description

 A collection of GPU-accelerated operators for image processing and computer vision Initiated in 2005  A framework friendly to computer vision developers • Activation of low level GPU codelets: OpenGL & GLSL, NVIDIA CUDA (OpenCL on the way) • Data transfers between CPU and GPU memories • Compatible with OpenCV  applications migration • Self-adaptive to the hardware platform  spectrum of consumer PCs

page 23 05/05/2010 Y. Allusse, P. Horain GpuCV: main features

 Compatible on multiple OS such as MS Windows XP and LINUX (Mac OSX on the way).  Help developers to manage:  Hardware compatibilities  Data transfers between CPU and GPU memories  Activation of low level GPU codelets: OpenGL & GLSL, NVIDIA CUDA, OpenCL (on the way)

page 24 6 mai 2010 Y. Allusse, P. Horain GpuCV

Operators plugins & Integration with OpenCV

direction ou services GpuCV: Plugins

 Plugins are a set of operators sharing the same programming scheme or technologies:

 CPU: OpenMP / SSE / MMX …

 GPU: CUDA / OpenCL / OpenGL …

 They are in correspondance with their OpenCV counter-part:

 CXCORE / CV / HIGHGUI / CVAUX

page 26 06/05/2010 Y. Allusse, P. Horain GpuCV: Plugins

GpuCV = GPGPU framework + GPU-accelerated Computer Vision plugins OpenCV GPU-accelerated application GpuCV switch mechanism

3rd parts CUDA OpenGL plugins plugins plugins OpenCV library GpuCV framework

page 27 06/05/2010 Y. Allusse, P. Horain GpuCV: Integration with OpenCV

 A strict naming convention is used to distinguish operators:  CvAdd(): original OpenCV operator

 CvgAdd(): OpenGL accelerated operator

 CvgCudaAdd(): CUDA accelerated operator

 CvgswAdd(): switch operator  Operators definition is stricly based on OpenCV definitions  Function wrapper definitions in headers: #define cvAdd cvgswAdd

page 28 06/05/2010 Y. Allusse, P. Horain GpuCV: Integration with OpenCV

 To get an existing OpenCV application on GPU: – Replace OpenCV headers by GpuCV headers – Link with GpuCV switch libraries – Add an init function 'cvgswInit(,)'

 A few coding rules to respect

– NO direct access to buffer 'IplImage::imagedata' – Re-use images to avoid new allocations

page 29 06/05/2010 Y. Allusse, P. Horain GpuCV: Example of OpenCV application

#include … //other includes //size 3 cvMorphologyEx( imgSrc, imgTmp2, imgTmp, #include elemStrut3,CV_MOP_OPEN, 1); #include cvMorphologyEx( imgTmp2, imgDst, imgTmp, #include elemStrut3,CV_MOP_CLOSE, 1); int main(int argc, char **argv) //size 5 { cvMorphologyEx( imgDst, imgTmp2, imgTmp, //load source image elemStrut5,CV_MOP_OPEN, 1); IplImage * imgSrc = cvLoadImage(MY_IMAGE_PATH); cvMorphologyEx( imgTmp2, imgDst, imgTmp, //create structuring elements of size 3 and 5 elemStrut5,CV_MOP_CLOSE, 1); IplConvKernel* elemStrut3 = cvCreateStructuringElementEx(3,3, 1, 1, CV_SHAPE_RECT, NULL); //show results IplConvKernel* elemStrut5 = cvCreateStructuringElementEx(5,5, 1, cvNamedWindow(« ResultImage », 1); 1, CV_SHAPE_RECT, NULL); cvShowImage(« ResultImage », imgDst); unsigned int depth = imgSrc->depth; cvWaitKey(0); unsigned int depth = imgSrc->nChannels; cvDestroyWindow(« ResultImage »); CvSize size = cvGetSize(imgSrc); //release data //create temporary images to be processed cvReleaseStructuringElement(&elemStrut3); IplImage * imgDst = cvReleaseStructuringElement(&elemStrut5); cvCreateImage(size, depth, channels); cvReleaseImage(&imgSrc); IplImage * imgTmp = cvReleaseImage(&imgDst); cvCreateImage(size, depth, channels); cvReleaseImage(&imgTemp); IplImage * imgTmp2 = cvReleaseImage(&imgTemp2); cvCreateImage(size, depth, channels); }

page 30 May 6, 2010 Y. Allusse, P. Horain GpuCV: Example of OpenCV application

#include … //other includes //size 3 cvMorphologyEx( imgSrc, imgTmp2, imgTmp, elemStrut3,CV_MOP_OPEN, 1); #include cvMorphologyEx( imgTmp2, imgDst, imgTmp, #include elemStrut3,CV_MOP_CLOSE, 1); #include //size 5 int main(int argc, char **argv) cvMorphologyEx( imgDst, imgTmp2, imgTmp, { elemStrut5,CV_MOP_OPEN, 1); //set path to GpuCV data files cvMorphologyEx( imgTmp2, imgDst, imgTmp, cvgSetShaderPath(MY_GPUCV_DATA_PATH); elemStrut5,CV_MOP_CLOSE, 1); //init GpuCV switch cvgswInit(true, false); //Create OpenGL context, simple thread //show results cvNamedWindow(« ResultImage », 1); //load source image cvShowImage(« ResultImage », imgDst); IplImage * imgSrc = cvLoadImage(MY_IMAGE_PATH); //create structuring elements of size 3 and 5 cvWaitKey(0); IplConvKernel* elemStrut3 = cvCreateStructuringElementEx(3,3, 1, 1, cvDestroyWindow(« ResultImage »); CV_SHAPE_RECT, NULL); IplConvKernel* elemStrut5 = cvCreateStructuringElementEx(5,5, 1, 1, CV_SHAPE_RECT, NULL); //release data unsigned int depth = imgSrc->depth; cvReleaseStructuringElement(&elemStrut3); unsigned int depth = imgSrc->nChannels; cvReleaseStructuringElement(&elemStrut5); CvSize size = cvGetSize(imgSrc); cvReleaseImage(&imgSrc); //create temporary images to be processed cvReleaseImage(&imgDst); IplImage * imgDst = cvReleaseImage(&imgTemp); cvCreateImage(size, depth, channels); cvReleaseImage(&imgTemp2); IplImage * imgTmp = cvCreateImage(size, depth, channels); } IplImage * imgTmp2 = cvCreateImage(size, depth, channels);

page 31 May 6, 2010 Y. Allusse, P. Horain GpuCV: Example of OpenCV application

#include … //other includes //size 3 cvMorphologyEx( imgSrc, imgTmp2, imgTmp, elemStrut3,CV_MOP_OPEN,iterac); #include cvMorphologyEx( imgTmp2, imgDst, imgTmp, #include elemStrut3,CV_MOP_CLOSE,iterac); #include //size 5 int main(int argc, char **argv) cvMorphologyEx( imgDst, imgTmp2, imgTmp, { elemStrut5,CV_MOP_OPEN,iterac); //set path to GpuCV data files cvMorphologyEx( imgTmp2, imgDst, imgTmp, cvgSetShaderPath(MY_GPUCV_DATA_PATH); elemStrut5,CV_MOP_CLOSE,iterac); //init GpuCV switch cvgswInit(true, false); //Create OpenGL context, simple thread //show results cvNamedWindow(« ResultImage », 1); //load source image cvShowImage(« ResultImage », imgDst); IplImage * imgSrc = cvLoadImage(MY_IMAGE_PATH); //create structuring elements of size 3 and 5 cvWaitKey(0); IplConvKernel* elemStrut3 = cvCreateStructuringElementEx(3,3, 1, 1, cvDestroyWindow(« ResultImage »); CV_SHAPE_RECT, NULL); IplConvKernel* elemStrut5Only = cvCreateStructuringElementEx(5,5,a few line changes 1, 1, to get hybrid applications CV_SHAPE_RECT, NULL); //release data unsigned int depth = imgSrc->depth; cvReleaseStructuringElement(&elemStrut3); unsigned int depth = imgSrc->nChannels; cvReleaseStructuringElement(&elemStrut5); CvSize size = cvGetSize(imgSrc); cvReleaseImage(&imgSrc); //create temporary images to be processed cvReleaseImage(&imgDst); IplImage * imgDst = cvReleaseImage(&imgTemp); cvCreateImage(size, depth, channels); cvReleaseImage(&imgTemp2); IplImage * imgTmp = cvCreateImage(size, depth, channels); } IplImage * imgTmp2 = cvCreateImage(size, depth, channels);

page 32 May 6, 2010 Y. Allusse, P. Horain GpuCV

Framework: Data management GpuCV: Memory locations

OpenGL context PCI-Express bus Central memory Video memory (RAM)‏ (VRAM)‏

CUDA context

page 34 6 mai 2010 Y. Allusse, P. Horain GpuCV: Data management

 CPU or GPU processing requires data in central memory and/or in graphics memory: • Multiple data instances in different locations • 'Smart transfer' allows to select the best path and reduce transfer costs • Different behaviors for input and output images

page 35 6 mai 2010 Y. Allusse, P. Horain GpuCV: Data descriptors

DataDescriptor_* classes to describes lications Holds image properties: • Data size / format (number of channels, element type). • Pointer to allocated data memory. • Flag raised if data present.

Holds methods: • To copy/convert properties with other data descriptors. • To copy data to other data descriptors.

page 36 6 mai 2010 Y. Allusse, P. Horain GpuCV: Data container

Container that describes and stores data

 GpuCV supports transparent data synchronization page 37 6 mai 2010 Y. Allusse, P. Horain GpuCV: selecting the memory type

Choosing a data location is easy: cvgSetLocation(OpenCV_Image, DataTransferFlag);

With: • DestinationType: destination data descriptor class. • OpenCV_Image: pointer to OpenCV image/matrix. • DataTransferFlag: specify if we transfer data or only allocate memory.

page 38 6 mai 2010 Y. Allusse, P. Horain GpuCV

Operators performances

direction ou services GpuCV: plugins performances

 Several steps are shown: 1. Optional pre-transfer to processor memory 2. Processor activation 3. Processing 4. Optional post-transfer

 Operators performances are recorded

 Here are a few examples:

page 40 06/05/2010 Y. Allusse, P. Horain GpuCV: the transfer bottleneck

Addition between 2 images

40  Cliquez pour modifier les styles du texte du Loading Time 35 masque 30 Deuxième niveau 25 Read back time s Troisième niveau m n i 20 Quatrième niveau e m i

T 15 Cinquième niveau Processing time on GPU 10

5 Processing time on CPU 0 128² 256² 512² 1024² 2048² Image size in pixels

 GPU much slower with transfer, much faster without transfer!!

page 41 06/05/2010 Y. Allusse, P. Horain GpuCV: fast for compute intensive operators

Image Morphological closing (Erode + Dilate)

180

160 Loading Time

140

120 Read back time s m 100 n i e 80 m i

T Morpho closing 60 on GPU 40 Morpho closing 20 on CPU 0 64² 128² 256² 512² 1024² 2048² Image size in pixels

 GPU can be faster even with transfer!

page 42 06/05/2010 Y. Allusse, P. Horain GpuCV: the activation issue

Addition between 2 small images  1,2 Cliquez pour modifier les styles du texte du

masque Loading Time 1 Deuxième niveau 0,8 Troisième niveau s Read back time m Quatrième niveau n i 0,6 e

m Cinquième niveau i T 0,4 Processing time on GPU

0,2 Processing time on CPU

0 0 32² 64² 128² 256² Image size in pixels GPU implies a constant activation delay  not efficient on small images! page 43 06/05/2010 Y. Allusse, P. Horain GpuCV: Performance prediction issue

Algorithm Plateform

Implementation CPU type

GPU type + Parameters drivers Operator performances Image OS size/format

Images Motherboard current locations & memory

page 44 06/05/2010 Y. Allusse, P. Horain GpuCV: Performance prediction issue

 To many parameters

 Developers can not predict all the cases

 Application performances are not predictable

page 45 06/05/2010 Y. Allusse, P. Horain GpuCV

Dynamic switching implementation

direction ou services GpuCV: Dynamic implementation switching

 For each operator Choose between multiple implementations

 Approach, based on:

 General parameters  Operator parameters  Images locations On the fly benchmarking Switching to the best implementation Switching transparent to users & developers!

page 47 06/05/2010 Y. Allusse, P. Horain GpuCV: Dynamic implementation switching

 Additional features:  Benchmarking results can be saved and reloaded (XML format)

 Switching options can be loaded with profiles

 Auto/forced switching modes (for profiling, bugs...)

 Calculate the amount of time saved by using GpuCV

page 48 06/05/2010 Y. Allusse, P. Horain GpuCV: Auto-switching operators

CXCORE library (Operation on array): • Initialization: cvCreateImage, cvCreateMat, cvReleaseImage, cvReleaseMat, cvCloneImage, cvCloneMat, cvGetRawData, cvSetData. • Copying and Filling: cvCopy, cvSetZero • Transforms and Permutations: cvSplit, cvMerge • Arithmetic, Logic and Comparison: cvgAdd, cvAddS, cvConvertScale, cvDiv, cvMax, cvMaxS, cvMin, cvMinS, cvMul, cvSub, cvSubRS, cvSubS • Statistics: cvAvg, cvSum, cvMinMaxLoc. • Linear Algebra: cvScaleAdd, cvGEMM • Math Functions: cvPow • And more... page 49 06/05/2010 Y. Allusse, P. Horain GpuCV: Auto-switching operators

CV library • Image Processing:

- Sampling, Interpolation and Geometrical Transforms: cvResize

- Morphological Operations: cvDilate, cvErode, cvMorphologyEx, cvSobel, cvLaplace, cvDeriche,...

- Filters and Color Conversion: cvCvtColor, cvThreshold

- Histograms: cvQueryHistValue_*D

- And more...

page 50 6 mai 2010 Y. Allusse, P. Horain GpuCV achievements

Benchmarks Benchmarks

Ex. processing 2048 x 2048 images with NVIDIA GeForce GTX 280 & Intel Core2 Duo 2.2 GHz (online benchmark) ######�#a#g#e#_#s#t#r#e#a#m###

 https://picoforge.int-evry.fr/cgi-bin/twiki/view/Gpucv/Web/WebHomeBenchmarks page 52 6 mai 2010 Y. Allusse, P. Horain GpuCV: benchmarks

GpuCV 1.0 benchmarks

page 53 06/05/2010 Y. Allusse, P. Horain Demo

direction ou services Conclusion Summary

 Benefits of GPUs: • High processing power • Lower power/price ratio than CPU  Penalties: • Requires additional data transfer • Activation delay  not efficient for small images • GPU operators implementations depend on hardware compatibilities.

 GpuCV: A ready to use GPU-accelerated library for Computer Vision page 56 6 mai 2010 Y. Allusse, P. Horain The GpuCV framework

 GPU acceleration for image processing and Computer Vision operators

 Multi-platform library (MS Windows & Linux)

 Open source (CeCill-B license)

 Compatible with the popular OpenCV library

 Hides the GPU programming complexity  data synchronization  codelets (kernels) management (GLSL,CUDA)

 Adaptive to the hardware platform: integrated benchmarking and implementation switching page 57 6 mai 2010 Y. Allusse, P. Horain GpuCV available as open source

Home: http://picoforge.int-evry.fr/projects/gpucv Visitors from around the world:

page 58 6 mai 2010 Y. Allusse, P. Horain GpuCV: Spin-off projects

GpuCV-Valo (Institut Telecom Innovation program, 2010) Aims at establishing: • a stable version of the library, with OpenCL support • a user community • commercial opportunities for GpuCV valorisation

MediaGPU (ANR-09-CORD-025, 2010-2012) • Massive multimedia GPU-based processing • Partners: Institut Telecom, INRIA Bordeaux, HPC-Project, ATEME, Play All  http://picoforge.int-evry.fr/projects/mediagpu

page 59 May 6, 2010 Y. Allusse, P. Horain Thank you!

 http://picoforge.int-evry.fr/projects/gpucv  http://www-public.it-sudparis.eu/~horain/OffreCDD.html

Questions? GpuCV: GPUs presentation

 GPU models: NVIDIA GTX 2xx/4xx ATI/AMD Radeon 4xxx/5xxx  Programming interfaces: OpenGL + Shaders DirectX / DirectCompute NVIDIA CUDA ATI Stream & CTM (Close to metal) OpenCL

page 61 May 6, 2010 Y. Allusse, P. Horain GpuCV: GPU programming

 Coalescent memory access  Task subdivision and parallelization  Memory transfer bottlenecks  Synchronization between contexts: OpenGL ↔ CUDA ↔ OpenCL ↔ ...

=> After 2 days – every body knows how to write a Kernel! => What next?

page 62 May 6, 2010 Y. Allusse, P. Horain GpuCV: Creating hybride applications

 What does developer / user need? Update existing applications: Without breaking the code Without irreversible changes Easy to maintain Toolbox with ready-to-use operators Possibility to select implementation: On the fly. The best one in every cases

page 63 May 6, 2010 Y. Allusse, P. Horain