Fastflow: Combining Pattern-Level Abstraction and Efficiency in Gpgpus

FastFlow: Combining Pattern-Level Abstraction and Efficiency in GPGPUs Marco Aldinucci, Computer Science Department, University of Turin, Italy PI of the CUDA research center at University of Turin, Italy M. Torquati (University of Pisa, Italy), M. Drocco, G. Peretti Pezzi (University of Turin, Italy), C. Spampinato (University of Catania, Italy) Date: 25 March 2014, San Jose, CA, USA - Presentation S4585 Outline ✤ Motivational example! ✤ An effective (and quite universal) image/video denoiser! ✤ Paradigmatic programming pattern for GPGPUs?! ✤ On patterns for multicore and GPGPUs! ✤ FastFlow! ✤ Some performance results! ✤ A demo Salt&Pepper! Restored noise 70% Original Salt & Pepper noise 0% ✤ Electronic and signal noise! 30% ✤ Uniform distribution of “saturated” white/black pixels! 70% ✤ Measured as percentage of affected vs overall pixels! ✤ Typically restored using statistic filters: e.g. median, median-adaptive! ✤ Not satisfactory for high levels of noise! ✤ not only outliers are filtered (image results smoothed) Salt & Pepper noise 3x3! 7x7! ✤ Electronic and signal noise! 5% kernel kernel ✤ Uniform distribution of “saturated” white/black pixels! ✤ Measured as percentage of affected vs overall pixels! ✤ Typically restored using statistic filters: e.g. median, median-adaptive! ✤ Not satisfactory for high levels of noise! ✤ not only outliers are filtered (image results smoothed) Original Var 10 Gaussian noise Var 30 Var 50 ✤ Poor illumination, temperature, circuit noise! ✤ White additive noise in the frequency domain! ✤ Measured with mean and variance of the Gaussian distribution! ✤ Affect all pixels, with an additive “white” value distributed as a Gaussian! ✤ Typically restored using statistic filters: e.g. median, Guassian smoothing! ✤ More difficult to manage: restored image results smoothed Two - s t a g e r e s t o r i n g Noisy image Noise map Restored image Detect Denoise ✤ progressive-switching/adaptive median! ✤ variational! ✤ neural/bayesian networks, fuzzy, … ✤ statistic ✤ Decouple detection decoupled from restoration! ✤ Pixels considered not outliers are not altered by restoration! ✤ False positives impair restoration quality Two - s t a g e r e s t o r i n g Noisy image Noise map Restored image Detect Denoise ✤ progressive-switching/adaptive median! ✤ variational! ✤ neural/bayesian networks, fuzzy, … ✤ statistic ✤ Statistic detection + variational restoration! ✤ High quality, edge-preserving filtering! ✤ Much more computational demanding, not really viable without parallelism! ✤ Matlab on a single 256x256 image with 50% of noise requires dozen of minutes! ✤ Stages can be pipelined Variational De-noising: an iterative optimisation problem img Try any possible color k for ! do the pixel, choose u, ! foreach i,j the one that minimize the ! if (noisyMap[i,j]) value of F(neighb8(i,j)) let N = neighb8(img,i,j) let k in 0..255 F(…) weight differently! u=argmin(F(k,N,noiseMap)) noisy and not noisy pixels img[i,j]=u while (process not converge) You can write it directly with C++ and CUDA! noiseMap but what happens splitting the work onto 2 GPGPUs? Variational Denoise: F(…) details (almost universal for different noise types) regularization term data fidelity term argmin F (u)=↵ R(u)+β D(u, d) u N 2 Z Z β In the spatial domain Fd N (u)= [ ui,j di,j + (S1 + S2)] | | − | 2 (i,j) N X2 S = 2 '(u d ) S = '(u u ) 1 · i,j − m,n 2 i,j − m,n (m,n) V N (m,n) V N c ✤ R. Chan, C. Ho, and M. Nikolova, Salt-and-pepper noise X2 i,j \ X2 i,j \ removal by median-type noise detectors and detail- preserving regularization. IEEE Trans. on Image Processing, vol. 14, 2005.! '(t)= t ↵ with 1 < ↵ 2 for Salt&Pepper ✤ M. Aldinucci et al. A parallel edge preserving algorithm for | | salt and pepper image denoising. In Intl. Conference on Image Processing Theory Tools and Applications (IPTA), 2012. IEEE. Convergence can’t be evaluated with a reduce (involves three iterations, i.e. memory) image: space, size: 2048x2048, noise: 90% Noisy Img Img (k-1) Img (k) Img (k+1) 26 24 22 → iterations 20 18 flat border std PSNR 16 cluster 14 quality quality 12 - - - 10 8 20 40 60 80 100 120 = = = n. cycles Residuals 2 4 2 5 8 5 6 7 6 time → 2 4 2 5 8 5 Diff of residuals 3 4 3 1 1 1 terminate if ∆(k) = 10 ∆(k+1) =3 Reduce of diffs ∆(k) ∆(k+1) M. Aldinucci, M. Drocco et al. − < ✏ IJHPCA submitted X X (k) 2 R P ∆P P Quality results Original! 10% impulsive noise 50% impulsive noise 90% impulsive noise Baboon standard! test image! 1024x1024 Restored PNSR 43.29dB MAE 0.35 PNSR 32.75dB MAE 2.67 PNSR 23.4 MAE 11.21 Patterns are a natural approach in GPGPUs S05: High Performance Computing with CUDA Data-parallel Algorithms & Data Structures John Owens UC Davis … and this well-known from long time TThhiinnkk IInn PPaarraalllleell DDaattaa--PPaarraalllleell AAllggoorriitthhmmss The GPU is a data-parallel processor Efficient algorithms require efficient building blocks Thousands of parallel threads Thousands of data elements to process This talk: data-parallel building blocks All data processed by the same program Map SPMD computation model Gather & Scatter Contrast with task parallelism and ILP Reduce Scan Best results when you “Think Data Parallel” Design your algorithm for data-parallelism but are “standard” and “flat”! Understand parallel algorithmic complexity and efficiency data-parallel pattern expressive! Use data-parallel algorithmic primitives as building blocks enough for the problems? 7 S05: High Performance Computing with CUDA 8 S05: High Performance Computing with CUDA 01 Rationale: patterns are there but are not simply map or reduce ✤ Detect-Denoise can be naturally pipelined! ✤ Denoise is a (sort of) map with a stencil! ✤ Where, x = < x1, x2, …, xn >, map f x = < f(x1), f(x2), …, f(xn) >! ✤ Can be written as a map, but is neither natural nor easy ! ✤ Convergence evaluation is map across three iterations and reduce! ✤ Even more complex to write it as a MapReduce (if not impossible)! ✤ Cholesky LU or C4.5 tree pruning with map, reduce or MapReduce? stencilReduce ✤ a (low-level) powerful pattern! loop ✤ it capture most of the interesting data before (…) parallel GPGPUs computations! stencil<stencilK,reduceK> (data[i], env) ✤ Subsumes: map, reduce, mapReduce! reduce op data ✤ Programmers do not need to write any line after (…) of host code to drive the GPGPU! ✤ D2H/H2D, data feeding, synchronisations, block configurations, … 01 stencilReduce Compute on host! possibly in parallel on CPU cores Unified Memory ! greatly simplify this part ✤ a (low-level) powerful pattern! loop ✤ it capture most of the interesting data before (…) parallel GPGPUs computations! stencil<stencilK,reduceK> (data[i], env) ✤ Subsumes: map, reduce, mapReduce! reduce op data ✤ Programmers do not need to write any line CUDA code after (…) of host code to drive the GPGPU! ✤ D2H/H2D, data feeding, synchronisations, block configurations, Compute on host! … possibly in parallel on CPU cores 01 Low-level approaches = lot of freedom (threads, CUDA, OpenCL, MPI, …) you can design your algorithms as you want! … as a car, you can drive it where you want car Val D’Orcia, Tuscany, Italy Low-level approaches = lot of freedom (threads, CUDA, OpenCL, MPI, …) you can design your algorithms as you want! ✤ Think in parallel & high-level! … as a car, you can drive it where you want ✤ Efficiency, portability, time-to-market! ✤ High-level parallel patterns! ✤ Describing collective behavior ! ✤ Can be: expressive, efficient, compositional! ✤ Targeting multicore, GPGPUs, distributed with an unifying vision! ✤ On various CPUs/GPGPUs and OSes Sao Paulo, Brasil FastFlow (FF) Parallel applications efficient and portable High-level patterns ✤ C++ header-only library! mapreduce, stencil, D&C, ... ✤ Portable everywhere exists a C++ compiler ! Core patterns ✤ Originally designed for high-frequency streaming! pipeline, farm, feedback ✤ Provides stream-oriented and data-parallel patterns! FastFlow Building blocks queues, ff_node, ... ✤ compositional, efficient! Open TCP/IP ✤ CUDA Accommodate diversity ! CL IB/OFED ✤ if you need a different pattern, do it extending a C+ Multicore and many-core platforms + class! Clusters of multicore + many-core ✤ Multi-core, GPGPUs, distributed! ✤ https://sourceforge.net/projects/mc-fastflow 01 Parallel applications efficient and portable High-level patterns mapreduce, stencil, D&C, ... Core patterns pipeline, farm, feedback Building blocks queues, ff_node, ... Open TCP/IP FF building blocks: nodes and channels CUDA CL IB/OFED Multicore and many-core platforms Clusters of multicore + many-core Xeon E7-4820 @2.0GHz Sandy Bridge 45 same core, different context channel name channel name 40 different cores, same CPU FF bound shmem FIFO channel different CPUs node 35 or channel or channel Single-Producer-Single-Consumer 30 lock-free fence-free queue 25 channel . channel 20 names name nanoseconds 15 or channel FF unbound shmem FIFO channel 10 mi-node Single-Producer-Single-Consumer 5 lock-free fence-free queue 0 64 1024 8192 . channel channel buffer size name names shmem channels communicate! or channel pointers in a message passing style ✤ MVAPICH ~ 190ns ! mo-node ✤ faster and more scalable than threads or processes! CAS/test-and-set implement. threads are non-blocking! (can be suspended using! Distributed zero-copy channel M. Aldinucci and M. Danelutto and P. Kilpatrick and M 0MQ/TCP or native IB/OFED Meneghin. An Efficient Synchronisation Mechanism for a native protocol) Multi-Core Systems. Euro-Par 2012. LNCS. Master-workers (0.5 μS workload) Ideal POSIX lock CAS 8 6 M 4 ... W1 W2 Wn Speedup

Load more