Alternativas De Altas Prestaciones Para Migración De Aplicaciones Matlab a GPU

Alternativas de altas prestaciones para migración de aplicaciones Matlab a GPU Francisco Javier García Blas y J. Daniel García [email protected] @redenzor Universidad Carlos III de Madrid Grupo ARCOS 2017 ADMINTECH 2017 - 7 al 9 de Febrero de 2017 2 ARCOS Group at UC3M § Universidad Carlos III of Madrid. § Founded in 1989. ¨ ARCOS Group: ¤ High Performance Computing and I/O. ¤ Data distribution and analysis. ¤ Real-time systems, maintenance and simulation. ¤ Programming models for application improvement. 3 Myths about Matlab 4 Myth 1: Matlab is difficult 5 Google Trend §Huge user community §Used in multiple areas of engineering, research, … 6 TIOBE index 7 Myth 1: Matlab is difficult 8 Myth 2: Matlab is slow 9 Matlab is slow? §Matlab relies on Intel Math Kernel Library (MKL) §Intel MKL provides automatic offloading (AO) § Multiprocessors § Intel Xeon Phi in a transparent and automatic way §Partially support for GPU parallelization 10 Myth 2: Matlab is slow 11 Myth 3: Matlab is expensive 12 13 Myth 3: Matlab is expensive 14 What we love from Matlab? §Fast and accurate prototyping §Good support for developers §Simple representation of algebraic operations §Graphical interface (GUI) for debugging and development §Matlab Simulink §Programing based on master-slave (workers) § Parfor – Parallel loops § GPU 15 What we don´t love from Matlab? §Memory management §Application deployment is highly dependent of Matlab §Limited alternatives for efficient parallelization in shared memory 16 Alternatives to Matlab in C++ (some) §Eigen §Armadillo §ArrayFire 17 ¿What is Armadillo? § Open-source library for C++ § Exploits a similar syntax to Matlab § Based on generic programming and templates with C++11 § Generic algorithms (transform, for_each, reduce) § Lambdas § C++ STL containers § Support for BLAS and LAPACK § Represent basic types for mathematical representation: § Mat (2D) § Cube (3D) § Support for acceleration by using GPUs § SIMD is also included as a feature (eg. SSE2) § column-major memory layout http://arma.sourceforge.net/ 18 Armadillo operators §Operators over Mat, Col, Row y Cube § + § - § / § * § % § == 19 Armadillo operators mat A(5, 5, fill::randu); double x = A(1,2); mat B = A + A; mat C = A * B; mat D = A % B; cx_mat X(A,B); B.zeros(); B.set_size(10,10); B.ones(5,6); B.print("B:"); mat::fixed<5,6> F; double aux_mem[24]; mat H(&aux_mem[0], 4, 6, false); // use auxiliary memory 20 Example: matrix multiplication #include <iostream> #include <armadillo> int main (int argc, char** argv) { using namespace std; using namespace arma; mat A = randu<mat>(5000,5000); mat B = randu<mat>(5000,5000); mat C = A *B; return 0; } 21 Example: Solver #include <iostream> #include <armadillo> int main() 3.0000 { -0.3636 arma::vec b; b << 2.0 << 5.0 << 2.0; arma::mat A; A << 1.0 << 2.0 << arma::endr << 2.0 << 3.0 << arma::endr << 1.0 << 3.0 << arma::endr; std::cout << ”Solution: “ << std::end; std::cout << solve(A,b) << std::end; return 0; } g++ -o solver solver.cpp -larmadillo 22 Example: Functional programming // Idiff(Idiff>1) = 1; // Idiff(Idiff<0) = 0; Idiff.elem( find(Idiff > 1.0) ).ones(); Idiff.elem( find(Idiff < 0.0) ).zeros(); 23 What Armadillo does not provide? §Full set of Matalb libraries § We need to implement functions in some cases §Auto-parallelization for memory arrays #pragma omp parallel for for (int i = 0; i < inda.n_elem; ++i) { slicevf_GM.at(inda(i)) = ODF(ODF.n_rows - 1 ,i); } http://arma.sourceforge.net/ 24 Matlab Vs Armadillo for i = 1:Niter fODFi = fODF; Ratio = mBessel_ratio(n_order,Reblurred_S); RL_factor = KernelT * ( Signal .* (Ratio)) ./ (KernelT * (Reblurred)+ eps); fODF = fODFi .* RL_factor; Reblurred = Kernel * fODF; Reblurred_S = (Signal .* Reblurred) ./ sigma2; sigma2_i = (1/N) * sum( (Signal.^2 + Reblurred.^2)/2 - (sigma2 .* Reblurred_S) .* MATLAB Ratio, 1)./n_order; sigma2_i = min((1/10)^2, max(sigma2_i,(1/50)^2)); sigma2 = repmat(sigma2_i,[N, 1]); end for (auto i = 0; i < Niter; ++i) { fODFi = fODF; Ratio = mBessel_ratio<T>(n_order,Reblurred_S); RL_factor = KernelT * (Signal % Ratio) / ((KernelT * Reblurred) + std::numeric_limits<T>::epsilon()); fODF = fODFi % RL_factor; Reblurred = Kernel * fODF; Reblurred_S = (Signal % Reblurred) / sigma2; sigma2_i = (1.0/N) * sum( (pow(Signal,2) + pow(Reblurred,2))/2 - (sigma2 % Armadillo Reblurred_S) % Ratio , 0) / n_order; sigma2_i.transform( [](T val) { return std::min<T>(std::pow<T>(1.0/10.0,2), std::max<T>(val, std::pow<T>(1.0/50.0,2))); } ); sigma2 = repmat(sigma2_i, N, 1); } 25 How can I improve performance so far? §Allows “quasi” magical auto-parallelization: § Using state-of-the-art BLAS libraries at link stage: § Intel MKL (CPU) § OpenBLAS (CPU) § Atlas (CPU) § Magma (GPU) § … §NVidia enables offloading for BLAS on GPU § cuBLAS: API required (fine grain) § NVBLAS: automatic offloading (coarse grain) 26 Configuring NVBLAS NVBLAS_LOGFILE nvblas.log NVBLAS_CPU_BLAS_LIB libmkl_rt.so #NVBLAS_CPU_BLAS_LIB libopenblas.so NVBLAS_GPU_LIST 0 #NVBLAS_GPU_LIST ALL NVBLAS_TILE_DIM 2048 #NVBLAS_GPU_DISABLED_SGEMM #NVBLAS_GPU_DISABLED_DGEMM #NVBLAS_GPU_DISABLED_CGEMM #NVBLAS_GPU_DISABLED_ZGEMM NVBLAS_CPU_RATIO_CGEMM 0.07 %> LD_PRELOAD=LD_PRELOAD=/usr/local/cuda-7.5/lib64/libnvblas.so ./miapplicacion 27 ArrayFire §Device-aware programming model §Based on the array class §Limited to data represented as 1D/2D/3D §Open source §Neutral § Nvidia § AMD (OpenCL) § CPU (CUDA) §Multiple features (BLAS, machine learning, financial, etc) §Supports CMake https://github.com/arrayfire/arrayfire 28 Basic examples array A = array(seq(1,9), 3, 3); af_print(A); af_print(A(0)); // first element af_print(A(0,1)); // first row, second column af_print(A(end)); // last element af_print(A(-1)); // last element (as well) af_print(A(1,span)); // second row af_print(A.row(end)); // last rowfila af_print(A.cols(1,end)); // all expect the second row float b_host[] = {0,1,2,3,4,5,6,7,8,9}; array b(10, 1, b_host); af_print(b(seq(3))); af_print(b(seq(1,7))); af_print(b(seq(1,7,2))); af_print(b(seq(0,end,2))); 29 Example #include <arrayfire.h> std::cout << “Benchmark N-by-N” << std::endl; // More includes for (auto n = 128; n <= 2048; n += 128) { std::cout << n << “x” << n << “ ”; static af::array A; A = af::constant (1, n, n); static void fn() double time = af::timeit(fn); { double gflops = 2.0 * powf(n,3) / (time * 1e9); af::array B = af::matmul(A,A); if (gflops > peak) peak = gflops; B.eval(); std::cout << gflops << “GF” << std::endl; } } int main(int argc, char ** argv) } catch (af::exception & e) { { std::cout << e.what() << std::endl;; double peak = 0; throw; try { } int device = atoi(argv[1]); std::cout << “## Max“ << peak << “ GFLOPS“ << std::endl; af::setDevice(device); return 0; af::info(); } 30 Gfor-loop §gfor-loop concurrent of iterations of the loop (in parallel) §Limited range size §FFT to each volume slice: for (int i = 0; i < N; ++i) A(span,span,i) = fft2(A(span,span,i)); // Sequential gfor (seq i, N) A(span,span,i) = fft2(A(span,span,i)); // Parallel 31 ArrayFire + Armadillo §Both share the same memory layout (column-major) §Possible to transfer data from Mat (Armadillo) to array (ArrayFire) af::array mat_gpu = af::array(rows, columns, mat_cpu.memptr()); … mat_gpu.host(mat_cpu.memptr()); 32 Use case: pHARDI § Identification of nerve fibers to study the degree of connectivity of the different areas in the brain § Performance: near real-time: § Operating room § Statictical research (data analytics) http://www.bitbucket.com/fjblas/phardi 33 34 Motivation Single Slice Whole Brain Volume (~ 100 slices) n Main disadvantage q Long computation times Not only for But also due to high number of voxels. Available Intravoxel Fiber Reconstruction Algorithms Computation time q Qball Imaging (QBI) CT: 10 min q Diffusion Orientation Transform Revisited (DOTR) CT: 30 min Kernel Based q Spherical Deconvolution of Multichannel DWMRI Data with MetHods Non-Gaussian Noise Models and Spatial Regularization. (RUMBA) CT: 3 hours q Generalized Q-sampling Imaging (GQI) CT: 30 min q Diffusion Spectrum Imaging (DSI) q Bayesian Estimation of Diffusion Parameters Obtained using Sampling Techniques (BEDPOSTX) CT: 8 hours Bayesian Estimation n Probabilistic Tracking q Probabilistic Tracking (PROBTRACKX) CT: 7 hours Matlab Codes C++ Codes QBI (TucH et al 2004) DOTR (Canales-Rodríguez, et al 2010) RUMBA (Canales-Rodríguez, et al 2015) GQI (Yeh, Fang-Cheng et al 2010) DSI (Wedeen VJ et al 2005) BEDPOSTX (Behrens et al 2003) PROBTRACKX (Behrens et al 2007) 35 pHARDI § Portable implementation for heteregenous systems § Totally migrated to C++ § High performance solution § Multi-device support § 100x faster than other develpments in the field (Bedpostx) 36 Evaluation (I) §Intel Xeon E5-2630 v3 § 8 cores § 2.40 GHz, § 128 GB RAM §Ubuntu 14.04 x64 §CUDA version 7.5 §Compilers § GCC 5.1 § Flags –O3 y –DNDEBUG §Nvidia Tesla K40 §GTX 680 37 Evaluation (II) 38 Evaluation (II) 39 Hand-ons §Access to the lab machine § Host: ssh urraca.arcos.inf.uc3m.es § User: admintech § Password: .admintech.2017. §SLURM §Script for deploying the examples: § ./launch_blas.sh cuda|cpu|opencl § ./launch_fft.sh cuda|cpu|opencl § ./launch_gc.sh cuda|cpu|opencl § ./launch_phardi cuda|cpu|mkl|opencl 40 Conclusions §It is possible to deploy applications out of the Matlab environment §Flexibility for development §Matlab as a DSL..

Alternativas De Altas Prestaciones Para Migración De Aplicaciones Matlab a GPU

ARM HPC Ecosystem

0 BLIS: a Modern Alternative to the BLAS

18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Introduchon to Arm for Network Stack Developers

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices

BLAS-On-Flash: an Alternative for Training Large ML Models?

Flexiblas - a ﬂexible BLAS Library with Runtime Exchangeable Backends

Scientific Computing

BLAS and LAPACK Runtime Switching