Alternativas de altas prestaciones para migración de aplicaciones Matlab a GPU

Francisco Javier García Blas y J. Daniel García [email protected] @redenzor Universidad Carlos III de Madrid Grupo ARCOS 2017

ADMINTECH 2017 - 7 al 9 de Febrero de 2017 2

ARCOS Group at UC3M § Universidad Carlos III of Madrid. § Founded in 1989.

¨ ARCOS Group: ¤ High Performance Computing and I/O. ¤ Data distribution and analysis. ¤ Real-time systems, maintenance and simulation. ¤ Programming models for application improvement. 3

Myths about Matlab 4

Myth 1: Matlab is difficult 5

Google Trend

§Huge user community §Used in multiple areas of engineering, research, … 6

TIOBE index 7

Myth 1: Matlab is difficult 8

Myth 2: Matlab is slow 9

Matlab is slow?

§Matlab relies on Intel (MKL)

§Intel MKL provides automatic offloading (AO) § Multiprocessors § Intel Xeon Phi in a transparent and automatic way

§Partially support for GPU parallelization 10

Myth 2: Matlab is slow 11

Myth 3: Matlab is expensive 12 13

Myth 3: Matlab is expensive 14

What we love from Matlab?

§Fast and accurate prototyping §Good support for developers §Simple representation of algebraic operations §Graphical interface (GUI) for debugging and development §Matlab Simulink §Programing based on master-slave (workers) § Parfor – Parallel loops § GPU 15

What we don´t love from Matlab?

§Memory management

§Application deployment is highly dependent of Matlab

§Limited alternatives for efficient parallelization in shared memory 16

Alternatives to Matlab in C++ (some)

§Eigen

§

§ArrayFire 17

¿What is Armadillo?

§ Open-source library for C++ § Exploits a similar syntax to Matlab § Based on generic programming and templates with C++11 § Generic algorithms (transform, for_each, reduce) § Lambdas § C++ STL containers § Support for BLAS and LAPACK § Represent basic types for mathematical representation: § Mat (2D) § Cube (3D) § Support for acceleration by using GPUs § SIMD is also included as a feature (eg. SSE2) § column-major memory layout http://arma.sourceforge.net/ 18

Armadillo operators

§Operators over Mat, Col, Row y Cube § + § - § / § * § % § == 19

Armadillo operators mat A(5, 5, fill::randu); double x = A(1,2); mat B = A + A; mat C = A * B; mat D = A % B; cx_mat X(A,B);

B.zeros(); B.set_size(10,10); B.ones(5,6); B.print("B:"); mat::fixed<5,6> F; double aux_mem[24]; mat H(&aux_mem[0], 4, 6, false); // use auxiliary memory 20

Example:

#include #include

int main (int argc, char** argv) { using namespace std; using namespace arma;

mat A = randu(5000,5000); mat B = randu(5000,5000); mat C = A *B; return 0; } 21

Example: Solver

#include #include

int main() 3.0000 { -0.3636 arma::vec b; b << 2.0 << 5.0 << 2.0;

arma::mat A; A << 1.0 << 2.0 << arma::endr << 2.0 << 3.0 << arma::endr << 1.0 << 3.0 << arma::endr;

std::cout << ”Solution: “ << std::end; std::cout << solve(A,b) << std::end;

return 0; } g++ -o solver solver.cpp -larmadillo 22

Example: Functional programming

// Idiff(Idiff>1) = 1; // Idiff(Idiff<0) = 0; Idiff.elem( find(Idiff > 1.0) ).ones(); Idiff.elem( find(Idiff < 0.0) ).zeros(); 23

What Armadillo does not provide?

§Full set of Matalb libraries § We need to implement functions in some cases §Auto-parallelization for memory arrays

#pragma omp parallel for for (int i = 0; i < inda.n_elem; ++i) { slicevf_GM.at(inda(i)) = ODF(ODF.n_rows - 1 ,i); }

http://arma.sourceforge.net/ 24

Matlab Vs Armadillo

for i = 1:Niter fODFi = fODF; Ratio = mBessel_ratio(n_order,Reblurred_S); RL_factor = KernelT * ( Signal .* (Ratio)) ./ (KernelT * (Reblurred)+ eps); fODF = fODFi .* RL_factor; Reblurred = Kernel * fODF; Reblurred_S = (Signal .* Reblurred) ./ sigma2; sigma2_i = (1/N) * sum( (Signal.^2 + Reblurred.^2)/2 - (sigma2 .* Reblurred_S) .*

MATLAB Ratio, 1)./n_order; sigma2_i = min((1/10)^2, max(sigma2_i,(1/50)^2)); sigma2 = repmat(sigma2_i,[N, 1]); end

for (auto i = 0; i < Niter; ++i) { fODFi = fODF; Ratio = mBessel_ratio(n_order,Reblurred_S); RL_factor = KernelT * (Signal % Ratio) / ((KernelT * Reblurred) + std::numeric_limits::epsilon()); fODF = fODFi % RL_factor; Reblurred = Kernel * fODF; Reblurred_S = (Signal % Reblurred) / sigma2; sigma2_i = (1.0/N) * sum( (pow(Signal,2) + pow(Reblurred,2))/2 - (sigma2 %

Armadillo Reblurred_S) % Ratio , 0) / n_order; sigma2_i.transform( [](T val) { return std::min(std::pow(1.0/10.0,2), std::max(val, std::pow(1.0/50.0,2))); } ); sigma2 = repmat(sigma2_i, N, 1); } 25

How can I improve performance so far?

§Allows “quasi” magical auto-parallelization: § Using state-of-the-art BLAS libraries at link stage: § Intel MKL (CPU) § OpenBLAS (CPU) § Atlas (CPU) § Magma (GPU) § …

§NVidia enables offloading for BLAS on GPU § cuBLAS: API required (fine grain) § NVBLAS: automatic offloading (coarse grain) 26

Configuring NVBLAS

NVBLAS_LOGFILE nvblas.log

NVBLAS_CPU_BLAS_LIB libmkl_rt.so #NVBLAS_CPU_BLAS_LIB libopenblas.so

NVBLAS_GPU_LIST 0 #NVBLAS_GPU_LIST ALL

NVBLAS_TILE_DIM 2048

#NVBLAS_GPU_DISABLED_SGEMM #NVBLAS_GPU_DISABLED_DGEMM #NVBLAS_GPU_DISABLED_CGEMM #NVBLAS_GPU_DISABLED_ZGEMM

NVBLAS_CPU_RATIO_CGEMM 0.07 %> LD_PRELOAD=LD_PRELOAD=/usr/local/cuda-7.5/lib64/libnvblas.so ./miapplicacion 27

ArrayFire

§Device-aware programming model §Based on the array class §Limited to data represented as 1D/2D/3D §Open source §Neutral § Nvidia § AMD (OpenCL) § CPU (CUDA) §Multiple features (BLAS, machine learning, financial, etc) §Supports CMake

https://github.com/arrayfire/arrayfire 28

Basic examples

array A = array(seq(1,9), 3, 3); af_print(A);

af_print(A(0)); // first element af_print(A(0,1)); // first row, second column

af_print(A(end)); // last element af_print(A(-1)); // last element (as well)

af_print(A(1,span)); // second row af_print(A.row(end)); // last rowfila af_print(A.cols(1,end)); // all expect the second row

float b_host[] = {0,1,2,3,4,5,6,7,8,9}; array b(10, 1, b_host); af_print(b(seq(3))); af_print(b(seq(1,7))); af_print(b(seq(1,7,2))); af_print(b(seq(0,end,2))); 29

Example

#include std::cout << “Benchmark N-by-N” << std::endl; // More includes for (auto n = 128; n <= 2048; n += 128) { std::cout << n << “x” << n << “ ”; static af::array A; A = af::constant (1, n, n); static void fn() double time = af::timeit(fn); { double gflops = 2.0 * powf(n,3) / (time * 1e9); af::array B = af::matmul(A,A); if (gflops > peak) peak = gflops; B.eval(); std::cout << gflops << “GF” << std::endl; } } int main(int argc, char ** argv) } catch (af::exception & e) { { std::cout << e.what() << std::endl;; double peak = 0; throw; try { } int device = atoi(argv[1]); std::cout << “## Max“ << peak << “ GFLOPS“ << std::endl; af::setDevice(device); return 0; af::info(); } 30

Gfor-loop

§gfor-loop concurrent of iterations of the loop (in parallel)

§Limited range size

§FFT to each volume slice:

for (int i = 0; i < N; ++i) A(span,span,i) = fft2(A(span,span,i)); // Sequential

gfor (seq i, N) A(span,span,i) = fft2(A(span,span,i)); // Parallel 31

ArrayFire + Armadillo

§Both share the same memory layout (column-major) §Possible to transfer data from Mat (Armadillo) to array (ArrayFire)

af::array mat_gpu = af::array(rows, columns, mat_cpu.memptr());

… mat_gpu.host(mat_cpu.memptr()); 32

Use case: pHARDI

§ Identification of nerve fibers to study the degree of connectivity of the different areas in the brain § Performance: near real-time: § Operating room § Statictical research (data analytics)

http://www.bitbucket.com/fjblas/phardi 33 34

Motivation Single Slice Whole Brain Volume (~ 100 slices) n Main disadvantage

q Long computation times Not only for But also due to high number of voxels. Available Intravoxel Fiber Reconstruction Algorithms Computation time

q Qball Imaging (QBI) CT: 10 min

q Diffusion Orientation Transform Revisited (DOTR) CT: 30 min Kernel Based q Spherical Deconvolution of Multichannel DWMRI Data with Methods Non-Gaussian Noise Models and Spatial Regularization. (RUMBA) CT: 3 hours

q Generalized Q-sampling Imaging (GQI) CT: 30 min

q Diffusion Spectrum Imaging (DSI)

q Bayesian Estimation of Diffusion Parameters Obtained using Sampling Techniques (BEDPOSTX) CT: 8 hours Bayesian Estimation n Probabilistic Tracking

q Probabilistic Tracking (PROBTRACKX) CT: 7 hours

Matlab Codes C++ Codes QBI (Tuch et al 2004) DOTR (Canales-Rodríguez, et al 2010) RUMBA (Canales-Rodríguez, et al 2015) GQI (Yeh, Fang-Cheng et al 2010) DSI (Wedeen VJ et al 2005) BEDPOSTX (Behrens et al 2003) PROBTRACKX (Behrens et al 2007) 35 pHARDI

§ Portable implementation for heteregenous systems § Totally migrated to C++ § High performance solution § Multi-device support § 100x faster than other develpments in the field (Bedpostx)

36

Evaluation (I)

§Intel Xeon E5-2630 v3 § 8 cores § 2.40 GHz, § 128 GB RAM §Ubuntu 14.04 x64 §CUDA version 7.5 §Compilers § GCC 5.1 § Flags –O3 y –DNDEBUG §Nvidia Tesla K40 §GTX 680 37

Evaluation (II)

38

Evaluation (II)

39

Hand-ons

§Access to the lab machine § Host: ssh urraca.arcos.inf.uc3m.es § User: admintech § Password: .admintech.2017.

§SLURM

§Script for deploying the examples: § ./launch_blas.sh cuda|cpu|opencl § ./launch_fft.sh cuda|cpu|opencl § ./launch_gc.sh cuda|cpu|opencl § ./launch_phardi cuda|cpu|mkl|opencl 40

Conclusions

§It is possible to deploy applications out of the Matlab environment

§Flexibility for development

§Matlab as a DSL.