Visioncpp: a SYCL-Based Computer Vision Framework for Heterogeneous and Embedded Architecture

VisionCPP: A SYCL-based Computer Vision Framework for Heterogeneous and Embedded Architecture Mehdi Goli Motivation • Computer vision • Different areas • Medical imaging, film industry, industrial manufacturing, weather forecasting, etc. • Embedded systems • Automotive systems • Surveillance cameras • Challenge • Huge computational and communication demands • The stringent size, power and memory resource constraints • High efficiency and accuracy • Operations • Large size of data, Sequence of operations, Minimum operation time, Real-time operation • Potential suitable parallelism • Data & pipeline parallelism codeplay.com 3 Existing Frameworks • OpenCV • Run-time optimisation • Adding custom function is hard • Eg. Channel level optimisation on GPU • Embedded systems • Not a trivial task • OpenVX • Graph-based model • Limited number of built-in function • AMD has announced its implementation • No standard way of adding custom function • Every event has different way of adding custom function codeplay.com 4 Requirements • Compile time optimisation • Specially embedded systems • Easy to add custom function • Unified front-end for different backend • Cross-platform portability • Performance Portability codeplay.com 5 SYCL • Khronos group trademark • Royalty-free • Open standard • Aim • Cross-platform abstraction layer • Portability and efficiency • OpenCL-enabled devices • “Single-source” style • Offline compilation Model • Implementation • ComputeCPP (Codeplay) • TriSYCL (Open-source) codeplay.com 6 VisionCPP • High-level framework • Ease of use • Applications • Custom operations • Performance portability • Separation of concern • No modification in application computation • OpenCL-enabled devices • CPU codeplay.com 7 VisionCPP Architecture • VisionCPP Frontend Architecture • Frontend Tree-based DSEL • Tree-based model • Backend Backend Architecture • Compile-time optimisation • Offline C++ compilation SYCL C++/OpenMP • Predicable memory size • Target Physical Hardware • Desktop • Embedded systems OpenCL-enabled device CPU codeplay.com 8 Tree Expression lRGB 2 sRGB • Colour Conversion Application: lHSV out 2 • Standard RGB -> Linear RGB lRGB • Linear RGB -> Linear HSV lHSV 2 • Desaturation of S Channel Scale • Linear HSV -> Linear RGB lRGB 2 • Linear RGB -> Standard RGB lHSV Coef sRGB 2 lRGB in codeplay.com 9 Tree Expression SYCL C++ lRGB 0: #include <visioncpp.hpp> g 2 0: #include <visioncpp.hpp> sRGB 1: int main() { 1: int main() { Queue 2: auto in= cv::imread(“input.jpg”); 2: auto in= cv::imread(“input.jpg”); lHSV No h out 2 Queue 3: auto q =get_queue<gpu_selector>(); f lRGB 3: 4: auto a = Node<sRGB, 512, 512,Image>(in.data)); 4: auto a = Node<sRGB, 512, 512,Host>(in.data)); 5: auto b = Node<sRGB2lRGB>(a); 5: auto b = Node<sRGB2lRGB>(a); Leaf Type lHSV Leaf Type e 2 6: auto c = Node<lRGB2lHSV>(b); Scale 6: auto c = Node<lRGB2lHSV>(b); 7: auto d = Node<Constant>(0.1); 7: auto d = Node<Constant>(0.1); 8: auto e = Node<lHSV2Scale>(c , d); lRGB c 2 8: auto e = Node<lHSV2Scale>(c , d); lHSV d Coef 9: auto f = Node<lHSV2lRGB>(e); 9: auto f = Node<lHSV2lRGB>(e); 10: auto g = Node<sRGB2lRGB>(f); 10: auto g = Node<sRGB2lRGB>(f); 11: auto h = execute<fuse> (g , q); sRGB 11: auto h = execute<fuse> (g ); No Queue 2 b lRGB Queue 12: auto ptr = h.get_data(); 12: auto ptr = h.get_data(); 13: auto output = cv::Mat(512 , 512 , CV_8UC3 , ptr.get()); 13: auto output = cv::Mat(512 , 512 , CV_8UC3 , ptr.get()); 14: cv::imshow (“Display Image” , output); a in 14: cv::imshow (“Display Image” , output); 15: return 0; 15: return 0; 16: } 16: } codeplay.com 10 VisionCPP frontend DSEL • Tree-based Structure Operation • Operation node Node • Leaf node Leaf Node codeplay.com 11 VisionCPP frontend DSEL • Tree-based Structure Operation Node<Functor>(child_node); • Operation node Node • Functors • operator()(T1 a); • operator()(T1 a, T2 b); • Leaf node Leaf Node codeplay.com 12 VisionCPP frontend DSEL • Tree-based Structure Operation Node<Functor>(child_node); • Operation node Node • Functors • operator()(T1 a); • operator()(T1 a, T2 b); Node<PixelType, Width, Height, MemoryModel >(ptr*); • Leaf node • SYCL Memory Model Leaf Node • Image (SYCL only) • Buffer (SYCL only) • Host (C++/OpenMP only) • Constant codeplay.com 13 Tree Expression(Functor) lRGB 2 sRGB 0: struct lHSV2Scale { lHSV 2 lRGB A 1: lHSV operator()( lHSV input, float coef) { lHSV 2 2: input.s() *= coef; Scale 3: return input; lRGB 2 lHSV Coef 4: }; sRGB 5: }; 2 lRGB B codeplay.com 14 Execution Policy: Fuse lRGB 2 sRGB • One Kernel lHSV 2 lRGB • Apply the Functor operations A • In serial order lHSV • Per pixel 2 Scale lRGB 2 lHSV Coef sRGB 2 lRGB B codeplay.com 15 Execution Policy: No Fuse lRGB 2 sRGB TMP OUT lHSV • Multiple Kernels 2 lRGB • One pre non-terminal Node A TMP • Device only temporary output lHSV OUT TMP 2 • on device global memory OUT Scale lRGB 2 lHSV Coef TMP OUT sRGB 2 lRGB B codeplay.com 16 Execution Policy: Custom Fuse lRGB 2 sRGB TMP OUT lHSV • Arbitrary Fusion 2 lRGB • Any sub-tree A lHSV 2 Scale lRGB 2 lHSV Coef TMP OUT sRGB 2 lRGB B codeplay.com 17 Backend structure(For Fuse Policy) SYCL C++ 1: template <typename Expr, typename… Acc> 1: template <typename Expr, typename... Acc> void sycl (handler& cgh, Expr expr, Acc… acc) { Accessor void cpp(Expr expr, Acc.. acc) { // sycl accessor for accessing data on device // output pinter for accessing data on host 2: auto outPtr = expr.out-> template get_accessor<write>(cgh) ; 2: auto outPtr = expr.out->get(); Pointer // sycl range representing valid range of accessing data // valid range for accessing data on host 3: auto rng = range < 2 > (Expr::Rows , Expr::Cols) ; 3: auto rng = range (Expr::Rows , Expr::Cols ); // sycl parallel for for parallelisng execution across the range // rebuilding the tuple of input pointer on host 4: cgh.parallel_for<Type>(rng), [=](item<2> itemID) { 4: auto tuple = make_tuple (acc) ; // rebuilding accessor tuple on the device // OpenMP directive for parallelising for loop 5: #pragma omp parallel for 5: auto tuple = make_tuple (acc) ; Parallel C++/OpenMP for 6: for(size_t i=0; i< rng.rows; i++) 7: for(size_t j=0; j< rng.cols; j++) // calling the eval function for each pixel // calling the eval function for each pixel 6: outPtr[itemID] = expr.eval ( itemID, tuple ); 8: outPtr[indx] = expr.eval (index (i , j), tuple ); 9: }; 7: }); 8: } codeplay.com 18 Colour Conversion VisionCPPFuse VisionCPPCustomFuse 250 • Framework VisionCPPNoFuse • VisionCPP 204.77 ) • Fuse 200 ms • Custom Fuse • NoFuse 150 • Platform 109.22 • Intel Time( Execution 100 Core i7-4790K 57.291 CPU 4.00GHz 49.254 50 26.297 3.579 7.811 13.228 0.968 2.096 3.561 13.51 0 512 1024 2048 4096 Image Size codeplay.com 19 Colour Conversion VisionCPPFuse OpenCV 70 • Framework 57.291 56.402 • OpenCV 60 ) • VisionCPP ms • Fuse 50 • Platform 40 • Intel Core i7-4790K Time( Execution 30 CPU 4.00GHz 20 13.51 13.375 10 3.579 4.513 0.968 1.603 0 512 1024 2048 4096 Image Size codeplay.com 20 Colour Conversion VisionCPPFuse OpenCV 35 32.899 • Framework 30.544 • OpenCV ) 30 ms • VisionCPP 25 • Fuse • Platform 20 Execution Time( Execution • Oland PRO 15 [Radeon R7 240] 8.994 10 8.118 5 2.137 2.361 0.642 0.749 0 512 1024 2048 4096 Image Size codeplay.com 21 Colour Conversion DataLocality = RGB2HSVHSV2RGB - (RGB2HSV + HSV2RGB) Image Size Read Write RGB2HSV for Time Time HSV2RGB Image Size for Data Locality VisionCPP (ms) (ms) Time (ms) VisionCPP Time(ms) • Framework 512x512 0.0886 512x512 0.1215 0.1253 0.1699 • VisionCPP 1024x1024 0.3334 1024x1024 0.4813 0.4859 0.6387 • Fuse 2048x2048 1.353 2048x2048 1.9135 1.9329 2.4655 • OpenCV 4096x4096 5.397 • Platform 4096x4096 8.3037 7.7967 10.3319 Image Size Read Write RGB2HSV HSV2RGB • Oland PRO for OpenCV Time Time Time Time [Radeon R7 240] (ms) (ms) (ms) (ms) • Tool 512x512 0.1246 0.1253 0.1338 0.1247 • CodeXL 1024x1024 0.4744 0.4853 0.4905 0.4816 2048x2048 1.8961 1.9286 2.0699 1.7486 4096x4096 7.6044 7.7913 8.2886 7.4403 codeplay.com 22 Conclusion • The high-level algorithm • Applications • Easy to write • Domain-specific embedded language (DSEL) • Graph nodes • Easy to write • C++ functors • The execution model is separated from algorithm • Portable between different programming models and architectures. • SYCL on top of OpenCL on heterogeneous devices • The developer can control everything independently • Graphs, node implementations and execution model. • Comparable Performance codeplay.com 23 Future work • Histogram • Optimise Neighbour operations • Nested Convolution • Hierarchical parallelism • Pyramid • Performance portability • Embedded system codeplay.com 24 @codeplaysoft [email protected].

Load more