<<

Tecnura ISSN: 0123-921X [email protected] Universidad Distrital Francisco José de Caldas Colombia

Hernández B, Esteban; Montoya Gaviria, Gerardo de Jesús; Montenegro, Carlos Enrique Parallel programming languages on heterogeneous architectures using openmpc, ompss, and Tecnura, vol. 18, núm. 2, diciembre, 2014, pp. 160-170 Universidad Distrital Francisco José de Caldas Bogotá, Colombia

Available in: http://www.redalyc.org/articulo.oa?id=257059813014

How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative Tecnura http://revistas.udistrital.edu.co/ojs/index.php/Tecnura/issue/view/687 DOI: http://doi.org/10.14483/udistrital.jour.tecnura.2014.DSE1.a14

Reflexión Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp

Lenguajes para programación paralela en arquitecturas heterogéneas utilizando openmpc, ompss, openacc y openmp

Esteban Hernández B.*, Gerardo de Jesús Montoya Gaviria**, Carlos Enrique Montenegro***

Fecha de recepción: June 10tyh, 2014 Fecha de aceptación: November 4t, 2014

Citation / Para citar este artículo: Hernández, E., Gaviria, G. de J. M., & Montenegro, . E. (2014). Para- llel programming languages on heterogeneous architectures using OPENMPC, OMPSS, OPENACC and OPENMP. Revista Tecnura, 18 (Edición especial doctorado), 160–170. doi: 10.14483/udistrital.jour.tec- nura.2014.DSE1.a14

RESUMEN ABSTRACT En el campo de la programación paralela, ha arri- On the field of parallel programing has emerged a bado un nuevo gran jugador en los últimos 10 new big player in the last 10 years. The GPU’s have años. Las GPU han tomado una importancia re- taken a relevant importance on scientific computing levante en la computación científica debido a because they offer a high performance computing, que ofrecen alto rendimiento computacional, low cost and simplicity of implementation. Howe- bajo costos y simplicidad de implementación; sin ver, one of the most important challenges is the pro- embargo, uno de los desafíos más grandes que gram languages used for this devices. The effort for poseen son los lenguajes utilizados para la progra- recoding algorithms designed for CPUs is a critical mación de los dispositivos. El esfuerzo de reescri- problem. In this paper we review three of principal bir algoritmos diseñados originalmente para CPU frameworks for programming CUDA devices com- es uno de los mayores problemas. En este artículo pared with the new directives introduced on the se revisan tres frameworks de programación para OpenMP 4 standard resolving the Jacobi iterative la tecnología CUDA y se realiza una compara- method. ción con el reciente estándar OpenMP versión 4, Keywords: CUDA, Jacobbi method, OmpSS, Ope- resolviendo el método iterativo de Jacobi. nACC, OpenMP, OpenMP, Parallel Programming. Palabras clave: Método de Jacobi, OmpSS, Ope- nACC, OpenMP, Programación paralela.

* Network Engineering with Master degree on engineering and Free Software construction, minor degree on applied mathematics and network software construction. Now running Doctorate studies on Engineering at Universidad Distrital and works as principal architect on RUNT. Now works in focus on Parallel Programming, high performance computing, computational numerical simulation and numerical weather forecast. E-mail: [email protected] ** Engineer in meteorology from the University of Leningrad, with a doctorate in physical-mathematical sciences of State Moscow University, pioneer in the area of meteorology in Colombia, in charge of meteorology graduate at Universidad Nacional de Colombia, researcher and director of more than 12 graduate theses in meteorology dynamic area and numerical forecast, air quality, efficient use of climate models and weather. He is currently a full professor of the Faculty of Geosciences at Universidad Nacional de Colombia. E-mail: gdmontoyag@ unal.edu.co *** System Engineering, PhD and Master degree on Informatics, director of research group GIIRA with focus on Social Network Analyzing, eLearning and data visualization. He is currently associate professor of Engineering Faculty at Universidad Distrital. E-mail: cemonte- [email protected]

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 160 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

INTRODUCTION is framework focusses on task decomposition para- digm for developing parallel applications on cluster Since 10 years ago, the proces- environments with heterogeneous architectures. It sors have used the GPUs as principal element on provides a set of compiler directives that can be the new approach in parallel programming; it’s used to annotate a sequential code. Additional fea- evolved from a graphics-specific accelerator to a tures have been added to support the use of ac- general-purpose computing device and at this time celerators like GPUs. OmpSS is based on StartsS a is considered to be in the era of GPUs. (Nickolls & task based programming model. It is based on an- Dally, 2010). However, the main obstacle for large notating a serial application with directives that are adoption on the community has been translated by the compiler. With it, the same pro- the lack of standards that allow programming on gram that runs sequentially in a node with a single unified form different existing hardware solutions GPU can run in parallel in multiple GPUs either (Nickolls & Dally, 2010). The most important play- local (single node) or remote (cluster of GPUs). Be- er on GPU solutions is ® with the CUDA® sides performing a task-based parallelization, the language programming and his own compiler moves the data as needed between (nvcc) (Hill & Marty, 2008), with thousands of in- the different nodes and GPUs minimizing the im- stalled solutions and reward on top500 supercom- pact of communication by using affinity schedul- puter list, while the portability is the main problem. ing, caching, and by overlapping communication Some community project and some hardware alli- with the computational task. ance have proposed solutions for resolve this issue. OmpSs is based on the OpenMP programming OmpSS, OpenACC and OpenMPC have emerged model with modifications to its execution and as the most promising solutions (Vetter, 2012) using memory model. It also provides some extensions the OpenMP base model. In the last year, OpenMP for synchronization, data motion and heterogen- board released the version 4 (OpenMP, 2013) appli- eity support. cation program interface with support for external devices (including GPUs and Vector Processors). In 1) Execution model: OmpSs uses a -pool this paper we compare the four implementation of execution model instead of the traditional OpenMP Jacobi’s factorization, to show the advantages and fork-join model. The master thread starts the exe- disadvantages of each framework. cution and all other threads cooperate executing the work it creates (whether it is from work sharing or task constructs). Therefore, there is no need for a METHODOLOGY parallel region. Nesting of constructs allows other threads to generate work as well (Figure 1). The frameworks used working as extensions of 2) Memory model: OmpSs assumes that mul- #pragmas of the C languages offering the simplest tiple address spaces may exist. As such shared way to programming without development com- data may reside in memory locations that are not plicate and external elements. In the next section we describe the frameworks and give some imple- directly accessible from some of the computa- mentations examples. In the last part, we show the tional resources. Therefore, all parallel code can pure CUDA kernels implementations. only safely access private data and shared data which has been marked explicitly with our ex- Ompss tended syntax. This assumption is true even for SMP machines as the implementation may real- Ompss (a programming model form Barcelona Su- locate shared data to improve memory accesses percomputer center based on OpenMP and StarSs) (e.g., NUMA).

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 161 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

3) Extensions: with an attached accelerator device, such as a GPU. Function tasks: OmpSs allows to annotate func- Much of a user application executes on the host. tion declarations or definitions (Durán, Pérez, Compute intensive regions are offloaded to the ac- Ayguadé, Badia & Labarta, 2008), with a task dir- celerator device under control of the host. The de- ective. In this case, any call to the function creates vice executes parallel regions, which typically a new task that will execute the function body. The contain work- sharing loops, or kernels regions, data environment of the task is captured from the which typically contain one or more loops which function arguments. are executed as kernels on the accelerator. Even in Dependency synchronization: OmpSs inte- accelerator-targeted regions, the host may orches- grates the StarSs dependence support (Durán et al., trate the execution by allocating memory on the ac- 2008). It allows annotating tasks with three clauses: celerator device, initiating data transfer, sending the input, output, in/out. They allow expressing, re- code to the accelerator, passing arguments to the spectively, that a given task depends on some data compute region, queuing the device code, waiting produced before, which will produce some data, for completion, transferring results back to the host, or both. The syntax in the clause allows specifying and de-allocating memory (Figure 2). In most cases, scalars, arrays, pointers and pointed data. the host can queue a sequence of operations to be executed on the device, one after the other (Wolfe, 2013). The actual problems with OpenACC are rela- tionship with the only for-join model support and support for only commercial compilers can sup- port his directives (PGI, and CAPS) (Wolfe, 2013; Reyes, López, fumero & Sande, 2012). In the last year, only one open source implementations has support (accULL) (Reyes & López-Rodríguez, 2012).

Figure 1. OmpSS execution model

Source: Barcelona supercomputing Center, p. 11. http://www. training.prace-ri.eu/uploads/tx_pracetmo/OmpSsQuickOver- viewXT.pdf

OpenACC

OpenACC is an industry standard proposed for hete- rogeneous computing on Conferen- ce 2011. OpenACC follows the OpenMP approach, with annotation on Sequential code with compi- ler directives (pragmas), indicating those regions of Figure 2. OpenACC execution model code susceptible to be executed in the GPU. Source: Barcelona supercomputing Center p. 11. http://www. The execution model targeted by OpenACC API- training.prace-ri.eu/uploads/tx_pracetmo/OmpSsQuickOver- enabled implementations is host-directed execution viewXT.pdf

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 162 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

OpenMPC OpenMPC adding a numbers of pragmas for an- notate OpenMP parallel regions and select opti- The OpenMPC (OpenMP extendent for CUDA) is mization regions. The pragmas added has the a framework to hide the complexity of program- following form: #pragma <>. ming model and memory model to user (Lee & Ei- genmann, 2010). OpenMPC consists of a standard OpenMP release 4 OpenMP API plus a new set of directives and envi- ronment variables to control important CUDA-re- OpenMP is the most used framework for program- lated parameters and optimizations. ming parallel software with and OpenMPC addresses two important issues on GP- support on most of the existing compilers. With GPU programming: programmability and tunability. the explosion of multicore and manycore system, OpenMPC as a front-end programming model pro- OpenMP gains acceptance on parallel program- vides with abstractions of the com- ming community and hardware vendors. From plex CUDA programming model and high-level his creation to version 3 the focus of API was the controls over various optimizations and CUDA-re- CPUs environments, but with the introduction of lated parameters. OpenMPC included fully auto- GPUs and vector accelerators, the new 4 release matic compilation and user-assisted tuning system includes support for external devices (OpenMP, supporting OpenMPC. In addition to a range of 2013). Historically, OpenMP has support Simple compiler transformations and optimizations, the Instruction Multiple Data (SIMD) model only fo- system includes tuning capabilities for generating, cusses on fork-join model (Figure 3), but in this new pruning, and navigating the search space of com- release the task-base model (Duran et al., 2008; pilation variants. Podobas, Brorsson & Faxén, 2010) has been intro- OpenMPC use the compiler cetus (Dave, Bae, duced to gain performance with more parallelism Min & Lee, 2009) for automatic parallelization on external devices. The most important directives source to source. The Source code on C has 3 level introduced were target, teams and distributed. This of analyzing directives permit that a group of threads was distrib- - Privatization - Reduction Variable Recognition uted on a special devices and the result was copied - Induction Variable substitution to host memory (Figure 3).

Figure 3. OpenMP 4 execution model

Source: Parallel OpenMP. http://www.theclassifiedsplus.com/video/video/axnA3kcLHK4/intel-parallel-openmp.html

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 163 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

JACOBI ITERATIVE METHOD Choose an initial guest to the solution x. for k=1,2,… Iterative methods are suitable for large scale linear for i=1,2,…n equations. There are three commonly used iterati- xi=0 ve methods: Jacobi’s method, Gauss method and for j=1,2,…,i-1,i+1,…n SOR iterative methods (Gravvanis, Filelis-Papado- (k-1) xi = xi + ai,jxj poulos & Lipitakis, 2013; Huang, Teng, Wahid & end Ko, 2009). xi = (bi + xi)/ ai,j The last two iterative methods convergence end speed is faster than Jacobi’s iterative method but x(k)=x lack of parallelism. They have advantages to Jaco- check convergence; continue bi method only when implemented in sequential if necessary fashion and executed on traditional CPUs. On the end other hand, Jacobi’s iterative method has inher- ent parallelism. It’s suitable to be implemented on This iterative method can be implemented on a CUDA or vector accelerators to run concurrently parallel form, using shared or on many cores. The basic idea of Jacobi method is (Margaris, Souravlas & Roumeliotis, 2014) (Figure convert the system into equivalent system then we 4). For distributed memory, it needs some explicit solved Equation (1) and Equation (2): synchronization and data out- data copy. In share memory, it needs distribution and data (1) merge in memory. It uses the following method:

(3)

Figure 4. Parallel form of Jacobi method on shared On each iteration we solve : memory (Alsemmeri, n..) Source: Parallel Jacobi Algorithm https://www.cs.wmich. edu/~elise/courses/cs626/s12/PARALLEL-JACOBI-ALGO- (2) RITHM11.pptx OPENMP IMPLEMENTATION Where the values from the (k-1) iteration are used to compute the values for the kth iteration. int n=LIMIT_N; The pseudo code for Jacobi method (Dongarra et int m=LIMIT_M; al., 2008): A[n][m]; // the D-1 matrix

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 164 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

Anew [n][m]; OPENACC IMPLEMENTATION y_vector[n]; //b vector //fill the matriz with initial conditions int n=LIMIT_N; … int m=LIMIT_M; #pragma omp parallel for shared (m, n, Anew, A[n][m]; // the D-1 matrix A) Anew [n][m]; for (int j = 1; j < n-1; j++) y_vector[n]; //b vector { //fill the matriz with initial conditions for (int i = 1; i < m-1; i++ ) … { #pragma omp parallel for shared (m, n, Anew, Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] A) + A[j-1][i] + A[j+1][i]); #pragma acc kernels error = fmaxf (error, fabsf(Anew[j][i]-A[j] for( int j = 1; j < n-1; j++) [i])); { } for( int i = 1; i < m-1; i++ ) } { #pragma omp parallel for shared (m, n, Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] Anew, A) + A[j-1][i] + A[j+1][i]); for (int j = 1; j < n-1; j++) error = fmaxf( error, fabsf(Anew[j][i]-A[j] { [i])); for (int i = 1; i < m-1; i++ ) } { } A[j][i] = Anew[j][i]; #pragma omp parallel for shared (m, n, Anew, } A) } #pragma acc kernels if (iter % 100 == 0) printf(“%5d, %0.6f\n”, for( int j = 1; j < n-1; j++) iter, error); { iter++; for( int i = 1; i < m-1; i++ ) } { …//print the result A[j][i] = Anew[j][i]; } In this section, the for-joint model appears on } section annotate with #pragma omp parallel for if(iter % 100 == 0) printf(“%5d, %0.6f\n”, shared (m, n, Anew, A) where every threads (nor- iter, error); mally equals to cores) on system running a copy iter++; of code with different data section shared all vari- } ables named on shared() section. When the size …//print the result of m and n is minor or equals to number of cores, the performance is similar on GPUs and CPUs, In this implementation appears a new annota- but when the size if much higher that number of tion #pragma acc kernels, where it indicates that cores available, the performance of GPUs increas- the code will be executed. This simple annotation es because the parallelism level is higher (Fowers, hides a complex implementation of CUDA kernel, Brown, Cooke & Stitt, 2012; Zhang, Miao, & Wang, the copy of data from host to devices and devices 2009). to GPU and definition of grid of threads and the

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 165 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

data manipulation (Amorim & Haase, 2009; Sand- cudaMemcpy( X_old_d, x_old, sizeof(float) ers & Kandrot, 2011; Zhang et al., 2009). * dim, cudaMemcpyHostToDevice); matMultVec<<>>(LU_d, X_ PURE CUDA IMPLEMENTATION old_d, tmp, dim, dim); // use x_old to compute (WANG, N.D.) LU X_old and store the result in tmp substract<<>>(B_d, tmp, … X_d, dim); // get the (B - LU X_old), which is int dimB, dimT; stored in X_d dimT = 256; diaMultVec<<>>(diag_d, dimB = (dim / dimT) + 1; X_d, dim); // get the new X

float err = 1.0; // 3. copy the new X back to the Host Memory // set up the memory for GPU cudaMemcpy( X, X_d, sizeof(float) * dim, float * LU_d; cudaMemcpyDeviceToHost); float * B_d; float * diag_d; // 4. calculate the norm of X_new - X_old float *X_d, *X_old_d; substract<<>>(X_old_d, X_d, float * tmp; tmp, dim); VecAbs<<>>(tmp, dim); cudaMalloc( (void **) &B_d, sizeof(float) * VecMax<<>>(tmp, dim); dim ); // copy the max value from Device to Host cudaMalloc( (void **) &diag_d, sizeof(float) cudaMemcpy(max, tmp, sizeof(float), * dim ); cudaMemcpyDeviceToHost); cudaMalloc( (void **) &LU_d, sizeof(float) * // cuda kernel for vector multiplication dim * dim); __global__ void matMultVec(float * mat_A, float * vec, cudaMemcpy( LU_d, LU, sizeof(float) * dim * float * rst, dim, cudaMemcpyHostToDevice); int dim_row, cudaMemcpy( B_d, B, sizeof(float) * dim, int dim_col) cudaMemcpyHostToDevice); { cudaMemcpy( diag_d, diag, sizeof(float) * int rowIdx = threadIdx.x + blockIdx.x * dim, cudaMemcpyHostToDevice); blockDim.x; // Get the row Index int aIdx; cudaMalloc( (void **) &X_d, sizeof(float) * while(rowIdx < dim_row) dim); { cudaMalloc( (void **) &X_old_d, sizeof(float) rst[rowIdx] = 0; // clean the value at first * dim); for (int i = 0; i < dim_col; i++) cudaMalloc( (void **) &tmp, sizeof(float) * { dim); aIdx = rowIdx * dim_col + i; // Get the … index for the element a_{rowIdx, i} //call to cuda kernels rst[rowIdx] += (mat_A[aIdx] * vec[i] ); // do the multiplication // 2. Compute X by A x_old }

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 166 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

rowIdx += gridDim.x * blockDim.x; vec[tid] = vec[dim-1]; } } __syncthreads(); } } __syncthreads(); // sync all threads // cuda kernel for vector subtraction dim /= 2; // make the vector half size short. __global__ void substract(float *a_d, } float *b_d, } float *c_d, int dim) The effort for writing three kernels, manage- { ment the logic of grids dimensions, copy data from int tid = threadIdx.x + blockIdx.x * blockDim.x; hosts to GPU and GPU to host, the aspect of syn- while ( tid < dim ) chronization thread on the groups of Threads on { GPU and some aspects as ThreadID calculation c_d[tid] = a_d[tid] - b_d[tid]; required high computation on hardware devices tid += gridDim.x * blockDim.x; and code programming. } } TEST TECHNIQUE

__global__ void VecMax(float * vec, int dim) We take the three implementations of Jacobi me- { thod (Pure OpenMP, OpenACC, OpenMPC) and int tid = threadIdx.x + blockIdx.x * blockDim.x; running it on SUT (System Under Test) of Table 1, while (dim > 1) and make multiples running with square matri- { ces of incremental sizes (Table 2), take processing int mid = dim / 2; // get the half size time for analyzing the performance (Kim, n.d.; if (tid < mid) // filter the active thread Sun & Gustafson, 1991) against the code number { lines needed on the algorithm. if (vec[tid] < vec[tid+mid] ) // get the larger one between vec[tid] and vec[tid+mid] Table 1. Characteristics of System under Test vec[tid] = vec[tid+mid]; // and store the larger one in vec[tid] System Supermicro SYS-1027GR-TRF Intel® Xeon® 10-core E5-2680 V2 CPUs @ } CPU 2.80 GHz Memory 32GB DDR3 1600MHz //deal with the odd case NVIDIA Kepler K40 GPUs, 2880 Cuda Cores, GPU if (dim % 2 ) // if dim is odd...we need care Memory 12GB about the last element OS Red Hat Enterprise Linux Server release 6.4 { PGI Compiler Accelerator /C/C++ 14 Compiler if (tid == 0 ) // only use the vec[0] to com- release 9 for Linux pare with vec[dim-1] L2 Cache 256K { L3 Cache 25MB if (vec[tid] < vec[dim-1] )

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 167 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

Table 2. Matrix size performance comparison. interchange between memory segments of the pro- cessors (CPUs) and devices. Besides, it is neces- sary that languages have support for working with Matrix Mean Running size Implementation two or more devices on parallel using the same (N) Time (Seconds) code but running segments of high parallelism in OpenMP with 20 Threads 26.578 automatically form. OpenMP is the de facto stan- 2048 OpenACC 74,970 dard for shared memory programming model, but OpenMPC 69,890 the support for heterogeneous devices (Gpus, ac- OpenMP with 20 Threads 74,280 celerators, fpga, etc.) is in very early stage, the 4096 OpenACC 92,320 new frameworks and industrial API need help for a OpenMPC 75,120 growing and maturating standard. OpenMP with 20 Threads 116,23 8096 OpenACC 98,00 FINANCING OpenMPC 101,420 OpenMP with 20 Threads 522.320 This research was developed with own resources, 16192 OpenACC 190,230 in the Doctoral thesis process at Universidad Dis- OpenMPC 222,320 trital Francisco José de Caldas.

RESULT FUTURE WORKS

The performance on the three frameworks (OpenMP It is necessary to analyze the impact of frameworks v. 4, OpenACC, OpenMPC) presents a similar re- for multiple devices on the same host for mana- sult on square matrices with size of 2048; howev- gement problems of memory locality, unified er, if the size ingresses to 4096, the performance memory between CPU and devices and I/O ope- is little high on OpenACC implementation. With rations from devices using the most recent ver- huge matrices greater than 12288, another factor sion of its standard framework. With the support of as cache L2 and L3 has impact on process of data copy from host to device (Bader & Weidendorfer, GPUs device on the following version of GCC5, it 2009; Barragan & Steves, 2011; Gupta, Xiang, & is possible to obtain a compiler performance com- Zhou, 2013). The performance of using a frame- parison between commercial and standards open work against using direct GPU CUDA languages, source solutions. Furthermore, it’s necessary to re- was just a bit (<5%), but the number of code lines view the impact of using cluster with multi-devi- was 70% minor. This shows that the framework of- ces, messaging pass and MPI integration for high fers useful programming tools with computational performance computing using the language exten- performance advantage; however, the performance sions (Schaa & Kaeli, 2009). gains on GPUs is on region of algorithms with high level of parallelism, and low data coupling with the ACKNOWLEDGMENT host code (Dannert, Marek, & Rampp, 2013). We express our acknowledgment to director of CONCLUSION high performance center of Universidad Industrial de Santander, the director of advanced computing The new devices for high performance compu- center of Universidad Distrital and Pak Liu Tech- ting need standardized methods and languages nical responsible for application characterization, that permits interoperability, easy programming profiling and testing of HPC Advisor Council www. and well defined interfaces for integrating the data hpcadvisorycouncil.com

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 168 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

REFERENCES of the 6th Balkan Conference in Informatics on - BCI ’13, 23. doi:10.1145/2490257.2490266 Alsemmeri, M. (n.d.). CS 6260. Advanced Parallel Com- Gupta, S., Xiang, P. & Zhou, H. (2013). Analyzing lo- putations Course. Retrieved October 3, 2014, from cality of memory references in GPU architectures. https://cs.wmich.edu/elise/courses/cs626/home.htm … of the ACM SIGPLAN Workshop on Memory …, Amorim, R. & Haase, G. (2009). Comparing CUDA and 1–2. doi:10.1145/2492408.2492423 OpenGL implementations for a Jacobi iteration. Hill, M. & Marty, M. (2008). Amdahl’s law in the mul- … & Simulation, 2009. …, 22–32. doi:10.1109/ ticore era. Computer, 41(7), 33–38. doi:10.1109/ HPCSIM.2009.5192847 MC.2008.209 Bader, M. & Weidendorfer, J. (2009). Exploiting me- Huang, P., Teng, D., Wahid, K. & Ko, S.-B. (2009). Per- mory hierarchies in scientific computing. 2009 formance evaluation of hardware/software co- International Conference on High Performan- design of iterative methods of linear systems. ce Computing & Simulation, 33–35. doi:10.1109/ 2009 3rd International Conference on Signals, HPCSIM.2009.5192891 Circuits and Systems (SCS), 1–5. doi:10.1109/ Barragan, E. H. & Steves, J. J. (2011). Performance analy- ICSCS.2009.5414164 sis on multicore system using PAPI. In 2011 6th Kim, Y. (2013). Performance tuning techniques for GPU Colombian Computing Congress (CCC) (pp. 1–5). and MIC. 2013 Programming weather, climate, and IEEE. doi:10.1109/COLOMCC.2011.5936277 earth-system models on heterogeneous multi-core Dannert, T., Marek, A. & Rampp, M. (2013). Porting Lar- platforms. Boulder, CO, National Oceanic and At- ge HPC Applications to GPU Clusters: The Codes mospheric Administration. GENE and VERTEX. Retrieved from http://arxiv.org/ Lee, S. & Eigenmann, R. (2010). OpenMPC: Extended abs/1310.1485v1 OpenMP programming and tuning for GPUs. Pro- Dave, C., Bae, H., Min, S. & Lee, S. (2009). Cetus: A sour- ceedings of the 2010 ACM/IEEE International …, ce-to-source compiler infrastructure for multicores. (November), 1–11. doi:10.1109/SC.2010.36 Computer (0429535). Retrieved from http://ieeex- Margaris, A., Souravlas, S. & Roumeliotis, M. (2014). plore.ieee.org/xpls/abs_all.jsp?arnumber=5353460 Parallel Implementations of the Jacobi Linear Dongarra, J., Golub, G. H., Grosse, E., Moler, C. & Moo- Algebraic Systems Solve. arXiv Preprint arX- re, K. (2008). Netlib and NA-Net: Building a Scien- iv:1403.5805, 161–172. Retrieved from http://arx- tific Computing Community. Annals of the History iv.org/abs/1403.5805 of Computing, IEEE, 30(2), 30–41. doi:10.1109/ Meijerink, J. & Vorst, H. van der. (1977). An iterative MAHC.2008.29 solution method for linear systems of which the Durán, A. et al. (2008). Extending the OpenMP tas- coefficient matrix is a symmetric -matrix. Mathe- king model to allow dependent tasks. Lectu- matics of Computation, 31(137), 148–162. Retrie- re Notes in Computer Science, 5004, 111–122. ved from http://www.ams.org/mcom/1977-31-137/ doi:10.1007/978-3-540-79561-2_10 S0025-5718-1977-0438681-4/ Fowers, J., Brown, G., Cooke, P., & Stitt, G. (2012). A Nickolls, J. & Dally, W. (2010). The GPU compu- performance and energy comparison of FPGAs, ting era. Micro, IEEE, 30(2), 56–69. doi:10.1109/ GPUs, and multicores for sliding-window appli- MM.2010.41 cations. Proceedings of the ACM/SIGDA …, 47. OpenMP, A. (2013). OpenMP application program doi:10.1145/2145694.2145704 interface, v. 4.0. OpenMP Architecture Review Gravvanis, G. a., Filelis-Papadopoulos, C. K. & Lipitakis, Board. Retrieved from http://scholar.google.com/ E. a. (2013). On numerical modeling performance of scholar?hl=en&btnG=Search&q=intitle:OpenMP- generalized preconditioned methods. Proceedings +Application+Program+Interface#1

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 169 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro

Podobas, A., Brorsson, M. & Faxén, K. (2010). A compa- Symposium on Parallel & Distributed Processing, rison of some recent task-based parallel program- 1–12. doi:10.1109/IPDPS.2009.5161068 ming models, 1–14. Retrieved from http://soda. Sun, X. & Gustafson, J. (1991). Toward a better para- swedish-ict.se/3869/ llel performance metric. , 1093– Reyes, R., López, I., Fumero, J. & Sande, F. De. (2012). A 1109. Retrieved from http://www.sciencedirect. Comparative Study of OpenACC Implementations. com/science/article/pii/S0167819105800286 Jornadas Sarteco. Retrieved from http://www.jorna- Vetter, J. (2012). On the road to Exascale: lessons from con- dassarteco.org/js2012/papers/paper_150.pdf temporary scalable GPU systems. … /A* CRC Works- Reyes, R. & López-Rodríguez, I. (2012). accULL: An hop on Accelerator Technologies for …. Retrieved OpenACC implementation with CUDA and Open- from http://www.acrc.a-star.edu.sg/astaratipreg_2012/ CL support. Euro-Par 2012 Parallel …, 2(228398), Proceedings/Presentation - Jeffrey Vetter.pdf 871–882. Retrieved from http://link.springer.com/ Wang, Y. (n.d.). Jacobi iteration on GPU. Retrieved Sep- chapter/10.1007/978-3-642-32820-6_86 tember 25, 2014, from http://doubletony-pblog. Sanders, J. & Kandrot, E. (2011). CUDA by Example. azurewebsites.net/jacboi_gpu.html An Introduction to General-Purpose GPU Pro- Wolfe, M. (2013). The OpenACC Application Program- gramming …. Retrieved from http://scholar.google. ming Interface, 2. com/scholar?hl=en&btnG=Search&q=intitle:CUD- Zhang, Z., Miao, Q. & Wang, Y. (2009). CUDA-Ba- A+by+Example#1 sed Jacobi’s Iterative Method. Computer Scien- Schaa, D. & Kaeli, D. (2009). Exploring the multi- ce-Technology and …, 259–262. doi:10.1109/ ple-GPU design space. 2009 IEEE International IFCSTA.2009.68

Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 170 ]