VisionCPP: A SYCL-based Computer Vision Framework for Heterogeneous and Embedded Architecture
Mehdi Goli Motivation
• Computer vision • Different areas • Medical imaging, film industry, industrial manufacturing, weather forecasting, etc. • Embedded systems • Automotive systems • Surveillance cameras • Challenge • Huge computational and communication demands • The stringent size, power and memory resource constraints • High efficiency and accuracy • Operations • Large size of data, Sequence of operations, Minimum operation time, Real-time operation • Potential suitable parallelism • Data & pipeline parallelism
codeplay.com 3 Existing Frameworks
• OpenCV • Run-time optimisation • Adding custom function is hard • Eg. Channel level optimisation on GPU • Embedded systems • Not a trivial task • OpenVX • Graph-based model • Limited number of built-in function • AMD has announced its implementation • No standard way of adding custom function • Every event has different way of adding custom function
codeplay.com 4 Requirements
• Compile time optimisation • Specially embedded systems • Easy to add custom function • Unified front-end for different backend • Cross-platform portability • Performance Portability
codeplay.com 5 SYCL
• Khronos group trademark • Royalty-free • Open standard • Aim • Cross-platform abstraction layer • Portability and efficiency • OpenCL-enabled devices • “Single-source” style • Offline compilation Model • Implementation • ComputeCPP (Codeplay) • TriSYCL (Open-source)
codeplay.com 6 VisionCPP
• High-level framework • Ease of use • Applications • Custom operations • Performance portability • Separation of concern • No modification in application computation • OpenCL-enabled devices • CPU
codeplay.com 7 VisionCPP Architecture
• VisionCPP Frontend Architecture
• Frontend Tree-based DSEL • Tree-based model • Backend Backend Architecture • Compile-time optimisation • Offline C++ compilation SYCL C++/OpenMP • Predicable memory size
• Target Physical Hardware • Desktop • Embedded systems OpenCL-enabled device CPU
codeplay.com 8 Tree Expression
lRGB 2 sRGB • Colour Conversion Application: lHSV out 2 • Standard RGB -> Linear RGB lRGB • Linear RGB -> Linear HSV lHSV 2 • Desaturation of S Channel Scale
• Linear HSV -> Linear RGB lRGB 2 • Linear RGB -> Standard RGB lHSV Coef
sRGB 2 lRGB
in
codeplay.com 9 Tree Expression SYCL C++ lRGB 0: #include
14: cv::imshow (“Display Image” , output); a in 14: cv::imshow (“Display Image” , output); 15: return 0; 15: return 0; 16: } 16: } codeplay.com 10 VisionCPP frontend DSEL
• Tree-based Structure Operation • Operation node Node • Leaf node
Leaf Node
codeplay.com 11 VisionCPP frontend DSEL
• Tree-based Structure Operation Node
Leaf Node
codeplay.com 12 VisionCPP frontend DSEL
• Tree-based Structure Operation Node
codeplay.com 13 Tree Expression(Functor)
lRGB 2 sRGB
0: struct lHSV2Scale { lHSV 2 lRGB A 1: lHSV operator()( lHSV input, float coef) {
lHSV 2 2: input.s() *= coef; Scale 3: return input; lRGB 2 lHSV Coef 4: };
sRGB 5: }; 2 lRGB
B
codeplay.com 14 Execution Policy: Fuse
lRGB 2 sRGB • One Kernel lHSV 2 lRGB • Apply the Functor operations A • In serial order
lHSV • Per pixel 2 Scale
lRGB 2 lHSV Coef
sRGB 2 lRGB
B
codeplay.com 15 Execution Policy: No Fuse
lRGB 2 sRGB TMP OUT
lHSV • Multiple Kernels 2 lRGB • One pre non-terminal Node A TMP • Device only temporary output lHSV OUT TMP 2 • on device global memory OUT Scale
lRGB 2 lHSV Coef
TMP OUT
sRGB 2 lRGB
B
codeplay.com 16 Execution Policy: Custom Fuse
lRGB 2 sRGB TMP OUT
lHSV • Arbitrary Fusion 2 lRGB • Any sub-tree A
lHSV 2 Scale
lRGB 2 lHSV Coef
TMP OUT
sRGB 2 lRGB
B
codeplay.com 17 Backend structure(For Fuse Policy) SYCL C++ 1: template
// sycl accessor for accessing data on device // output pinter for accessing data on host 2: auto outPtr = expr.out-> template get_accessor
// sycl parallel for for parallelisng execution across the range // rebuilding the tuple of input pointer on host 4: cgh.parallel_for
// rebuilding accessor tuple on the device // OpenMP directive for parallelising for loop 5: #pragma omp parallel for 5: auto tuple = make_tuple (acc) ; Parallel C++/OpenMP for 6: for(size_t i=0; i< rng.rows; i++) 7: for(size_t j=0; j< rng.cols; j++) // calling the eval function for each pixel // calling the eval function for each pixel 6: outPtr[itemID] = expr.eval ( itemID, tuple ); 8: outPtr[indx] = expr.eval (index (i , j), tuple ); 9: }; 7: }); 8: }
codeplay.com 18 Colour Conversion VisionCPPFuse VisionCPPCustomFuse 250 • Framework VisionCPPNoFuse
• VisionCPP 204.77 )
• Fuse 200 ms • Custom Fuse • NoFuse 150 • Platform 109.22
• Intel Time( Execution 100 Core i7-4790K 57.291 CPU 4.00GHz 49.254 50 26.297 3.579 7.811 13.228 0.968 2.096 3.561 13.51
0 512 1024 2048 4096 Image Size codeplay.com 19 Colour Conversion VisionCPPFuse OpenCV 70 • Framework 57.291 56.402
• OpenCV 60 )
• VisionCPP ms • Fuse 50
• Platform 40 • Intel
Core i7-4790K Time( Execution 30 CPU 4.00GHz 20 13.51 13.375
10 3.579 4.513 0.968 1.603
0 512 1024 2048 4096 Image Size codeplay.com 20 Colour Conversion VisionCPPFuse OpenCV
35 32.899 • Framework 30.544
• OpenCV ) 30 ms
• VisionCPP 25 • Fuse
• Platform 20 Execution Time( Execution • Oland PRO 15 [Radeon R7 240] 8.994 10 8.118
5 2.137 2.361 0.642 0.749
0 512 1024 2048 4096 Image Size
codeplay.com 21 Colour Conversion DataLocality = RGB2HSVHSV2RGB - (RGB2HSV + HSV2RGB) Image Size Read Write RGB2HSV for Time Time HSV2RGB Image Size for Data Locality VisionCPP (ms) (ms) Time (ms) VisionCPP Time(ms) • Framework 512x512 0.0886 512x512 0.1215 0.1253 0.1699 • VisionCPP 1024x1024 0.3334 1024x1024 0.4813 0.4859 0.6387 • Fuse 2048x2048 1.353 2048x2048 1.9135 1.9329 2.4655 • OpenCV 4096x4096 5.397 • Platform 4096x4096 8.3037 7.7967 10.3319 Image Size Read Write RGB2HSV HSV2RGB • Oland PRO for OpenCV Time Time Time Time [Radeon R7 240] (ms) (ms) (ms) (ms) • Tool 512x512 0.1246 0.1253 0.1338 0.1247 • CodeXL 1024x1024 0.4744 0.4853 0.4905 0.4816 2048x2048 1.8961 1.9286 2.0699 1.7486 4096x4096 7.6044 7.7913 8.2886 7.4403 codeplay.com 22 Conclusion
• The high-level algorithm • Applications • Easy to write • Domain-specific embedded language (DSEL) • Graph nodes • Easy to write • C++ functors • The execution model is separated from algorithm • Portable between different programming models and architectures. • SYCL on top of OpenCL on heterogeneous devices • The developer can control everything independently • Graphs, node implementations and execution model. • Comparable Performance
codeplay.com 23 Future work
• Histogram • Optimise Neighbour operations • Nested Convolution • Hierarchical parallelism • Pyramid • Performance portability • Embedded system
codeplay.com 24 @codeplaysoft [email protected]