Visioncpp: a SYCL-Based Computer Vision Framework for Heterogeneous and Embedded Architecture

VisionCPP: A SYCL-based Computer Vision Framework for Heterogeneous and Embedded Architecture Mehdi Goli Motivation • Computer vision • Different areas • Medical imaging, film industry, industrial manufacturing, weather forecasting, etc. • Embedded systems • Automotive systems • Surveillance cameras • Challenge • Huge computational and communication demands • The stringent size, power and memory resource constraints • High efficiency and accuracy • Operations • Large size of data, Sequence of operations, Minimum operation time, Real-time operation • Potential suitable parallelism • Data & pipeline parallelism codeplay.com 3 Existing Frameworks • OpenCV • Run-time optimisation • Adding custom function is hard • Eg. Channel level optimisation on GPU • Embedded systems • Not a trivial task • OpenVX • Graph-based model • Limited number of built-in function • AMD has announced its implementation • No standard way of adding custom function • Every event has different way of adding custom function codeplay.com 4 Requirements • Compile time optimisation • Specially embedded systems • Easy to add custom function • Unified front-end for different backend • Cross-platform portability • Performance Portability codeplay.com 5 SYCL • Khronos group trademark • Royalty-free • Open standard • Aim • Cross-platform abstraction layer • Portability and efficiency • OpenCL-enabled devices • “Single-source” style • Offline compilation Model • Implementation • ComputeCPP (Codeplay) • TriSYCL (Open-source) codeplay.com 6 VisionCPP • High-level framework • Ease of use • Applications • Custom operations • Performance portability • Separation of concern • No modification in application computation • OpenCL-enabled devices • CPU codeplay.com 7 VisionCPP Architecture • VisionCPP Frontend Architecture • Frontend Tree-based DSEL • Tree-based model • Backend Backend Architecture • Compile-time optimisation • Offline C++ compilation SYCL C++/OpenMP • Predicable memory size • Target Physical Hardware • Desktop • Embedded systems OpenCL-enabled device CPU codeplay.com 8 Tree Expression lRGB 2 sRGB • Colour Conversion Application: lHSV out 2 • Standard RGB -> Linear RGB lRGB • Linear RGB -> Linear HSV lHSV 2 • Desaturation of S Channel Scale • Linear HSV -> Linear RGB lRGB 2 • Linear RGB -> Standard RGB lHSV Coef sRGB 2 lRGB in codeplay.com 9 Tree Expression SYCL C++ lRGB 0: #include <visioncpp.hpp> g 2 0: #include <visioncpp.hpp> sRGB 1: int main() { 1: int main() { Queue 2: auto in= cv::imread(“input.jpg”); 2: auto in= cv::imread(“input.jpg”); lHSV No h out 2 Queue 3: auto q =get_queue<gpu_selector>(); f lRGB 3: 4: auto a = Node<sRGB, 512, 512,Image>(in.data)); 4: auto a = Node<sRGB, 512, 512,Host>(in.data)); 5: auto b = Node<sRGB2lRGB>(a); 5: auto b = Node<sRGB2lRGB>(a); Leaf Type lHSV Leaf Type e 2 6: auto c = Node<lRGB2lHSV>(b); Scale 6: auto c = Node<lRGB2lHSV>(b); 7: auto d = Node<Constant>(0.1); 7: auto d = Node<Constant>(0.1); 8: auto e = Node<lHSV2Scale>(c , d); lRGB c 2 8: auto e = Node<lHSV2Scale>(c , d); lHSV d Coef 9: auto f = Node<lHSV2lRGB>(e); 9: auto f = Node<lHSV2lRGB>(e); 10: auto g = Node<sRGB2lRGB>(f); 10: auto g = Node<sRGB2lRGB>(f); 11: auto h = execute<fuse> (g , q); sRGB 11: auto h = execute<fuse> (g ); No Queue 2 b lRGB Queue 12: auto ptr = h.get_data(); 12: auto ptr = h.get_data(); 13: auto output = cv::Mat(512 , 512 , CV_8UC3 , ptr.get()); 13: auto output = cv::Mat(512 , 512 , CV_8UC3 , ptr.get()); 14: cv::imshow (“Display Image” , output); a in 14: cv::imshow (“Display Image” , output); 15: return 0; 15: return 0; 16: } 16: } codeplay.com 10 VisionCPP frontend DSEL • Tree-based Structure Operation • Operation node Node • Leaf node Leaf Node codeplay.com 11 VisionCPP frontend DSEL • Tree-based Structure Operation Node<Functor>(child_node); • Operation node Node • Functors • operator()(T1 a); • operator()(T1 a, T2 b); • Leaf node Leaf Node codeplay.com 12 VisionCPP frontend DSEL • Tree-based Structure Operation Node<Functor>(child_node); • Operation node Node • Functors • operator()(T1 a); • operator()(T1 a, T2 b); Node<PixelType, Width, Height, MemoryModel >(ptr*); • Leaf node • SYCL Memory Model Leaf Node • Image (SYCL only) • Buffer (SYCL only) • Host (C++/OpenMP only) • Constant codeplay.com 13 Tree Expression(Functor) lRGB 2 sRGB 0: struct lHSV2Scale { lHSV 2 lRGB A 1: lHSV operator()( lHSV input, float coef) { lHSV 2 2: input.s() *= coef; Scale 3: return input; lRGB 2 lHSV Coef 4: }; sRGB 5: }; 2 lRGB B codeplay.com 14 Execution Policy: Fuse lRGB 2 sRGB • One Kernel lHSV 2 lRGB • Apply the Functor operations A • In serial order lHSV • Per pixel 2 Scale lRGB 2 lHSV Coef sRGB 2 lRGB B codeplay.com 15 Execution Policy: No Fuse lRGB 2 sRGB TMP OUT lHSV • Multiple Kernels 2 lRGB • One pre non-terminal Node A TMP • Device only temporary output lHSV OUT TMP 2 • on device global memory OUT Scale lRGB 2 lHSV Coef TMP OUT sRGB 2 lRGB B codeplay.com 16 Execution Policy: Custom Fuse lRGB 2 sRGB TMP OUT lHSV • Arbitrary Fusion 2 lRGB • Any sub-tree A lHSV 2 Scale lRGB 2 lHSV Coef TMP OUT sRGB 2 lRGB B codeplay.com 17 Backend structure(For Fuse Policy) SYCL C++ 1: template <typename Expr, typename… Acc> 1: template <typename Expr, typename... Acc> void sycl (handler& cgh, Expr expr, Acc… acc) { Accessor void cpp(Expr expr, Acc.. acc) { // sycl accessor for accessing data on device // output pinter for accessing data on host 2: auto outPtr = expr.out-> template get_accessor<write>(cgh) ; 2: auto outPtr = expr.out->get(); Pointer // sycl range representing valid range of accessing data // valid range for accessing data on host 3: auto rng = range < 2 > (Expr::Rows , Expr::Cols) ; 3: auto rng = range (Expr::Rows , Expr::Cols ); // sycl parallel for for parallelisng execution across the range // rebuilding the tuple of input pointer on host 4: cgh.parallel_for<Type>(rng), [=](item<2> itemID) { 4: auto tuple = make_tuple (acc) ; // rebuilding accessor tuple on the device // OpenMP directive for parallelising for loop 5: #pragma omp parallel for 5: auto tuple = make_tuple (acc) ; Parallel C++/OpenMP for 6: for(size_t i=0; i< rng.rows; i++) 7: for(size_t j=0; j< rng.cols; j++) // calling the eval function for each pixel // calling the eval function for each pixel 6: outPtr[itemID] = expr.eval ( itemID, tuple ); 8: outPtr[indx] = expr.eval (index (i , j), tuple ); 9: }; 7: }); 8: } codeplay.com 18 Colour Conversion VisionCPPFuse VisionCPPCustomFuse 250 • Framework VisionCPPNoFuse • VisionCPP 204.77 ) • Fuse 200 ms • Custom Fuse • NoFuse 150 • Platform 109.22 • Intel Time( Execution 100 Core i7-4790K 57.291 CPU 4.00GHz 49.254 50 26.297 3.579 7.811 13.228 0.968 2.096 3.561 13.51 0 512 1024 2048 4096 Image Size codeplay.com 19 Colour Conversion VisionCPPFuse OpenCV 70 • Framework 57.291 56.402 • OpenCV 60 ) • VisionCPP ms • Fuse 50 • Platform 40 • Intel Core i7-4790K Time( Execution 30 CPU 4.00GHz 20 13.51 13.375 10 3.579 4.513 0.968 1.603 0 512 1024 2048 4096 Image Size codeplay.com 20 Colour Conversion VisionCPPFuse OpenCV 35 32.899 • Framework 30.544 • OpenCV ) 30 ms • VisionCPP 25 • Fuse • Platform 20 Execution Time( Execution • Oland PRO 15 [Radeon R7 240] 8.994 10 8.118 5 2.137 2.361 0.642 0.749 0 512 1024 2048 4096 Image Size codeplay.com 21 Colour Conversion DataLocality = RGB2HSVHSV2RGB - (RGB2HSV + HSV2RGB) Image Size Read Write RGB2HSV for Time Time HSV2RGB Image Size for Data Locality VisionCPP (ms) (ms) Time (ms) VisionCPP Time(ms) • Framework 512x512 0.0886 512x512 0.1215 0.1253 0.1699 • VisionCPP 1024x1024 0.3334 1024x1024 0.4813 0.4859 0.6387 • Fuse 2048x2048 1.353 2048x2048 1.9135 1.9329 2.4655 • OpenCV 4096x4096 5.397 • Platform 4096x4096 8.3037 7.7967 10.3319 Image Size Read Write RGB2HSV HSV2RGB • Oland PRO for OpenCV Time Time Time Time [Radeon R7 240] (ms) (ms) (ms) (ms) • Tool 512x512 0.1246 0.1253 0.1338 0.1247 • CodeXL 1024x1024 0.4744 0.4853 0.4905 0.4816 2048x2048 1.8961 1.9286 2.0699 1.7486 4096x4096 7.6044 7.7913 8.2886 7.4403 codeplay.com 22 Conclusion • The high-level algorithm • Applications • Easy to write • Domain-specific embedded language (DSEL) • Graph nodes • Easy to write • C++ functors • The execution model is separated from algorithm • Portable between different programming models and architectures. • SYCL on top of OpenCL on heterogeneous devices • The developer can control everything independently • Graphs, node implementations and execution model. • Comparable Performance codeplay.com 23 Future work • Histogram • Optimise Neighbour operations • Nested Convolution • Hierarchical parallelism • Pyramid • Performance portability • Embedded system codeplay.com 24 @codeplaysoft [email protected].

Visioncpp: a SYCL-Based Computer Vision Framework for Heterogeneous and Embedded Architecture

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support