Tecnura ISSN: 0123-921X [email protected] Universidad Distrital Francisco José de Caldas Colombia
Hernández B, Esteban; Montoya Gaviria, Gerardo de Jesús; Montenegro, Carlos Enrique Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Tecnura, vol. 18, núm. 2, diciembre, 2014, pp. 160-170 Universidad Distrital Francisco José de Caldas Bogotá, Colombia
Available in: http://www.redalyc.org/articulo.oa?id=257059813014
How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative Tecnura http://revistas.udistrital.edu.co/ojs/index.php/Tecnura/issue/view/687 DOI: http://doi.org/10.14483/udistrital.jour.tecnura.2014.DSE1.a14
Reflexión Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp
Lenguajes para programación paralela en arquitecturas heterogéneas utilizando openmpc, ompss, openacc y openmp
Esteban Hernández B.*, Gerardo de Jesús Montoya Gaviria**, Carlos Enrique Montenegro***
Fecha de recepción: June 10tyh, 2014 Fecha de aceptación: November 4t, 2014
Citation / Para citar este artículo: Hernández, E., Gaviria, G. de J. M., & Montenegro, C. E. (2014). Para- llel programming languages on heterogeneous architectures using OPENMPC, OMPSS, OPENACC and OPENMP. Revista Tecnura, 18 (Edición especial doctorado), 160–170. doi: 10.14483/udistrital.jour.tec- nura.2014.DSE1.a14
RESUMEN ABSTRACT En el campo de la programación paralela, ha arri- On the field of parallel programing has emerged a bado un nuevo gran jugador en los últimos 10 new big player in the last 10 years. The GPU’s have años. Las GPU han tomado una importancia re- taken a relevant importance on scientific computing levante en la computación científica debido a because they offer a high performance computing, que ofrecen alto rendimiento computacional, low cost and simplicity of implementation. Howe- bajo costos y simplicidad de implementación; sin ver, one of the most important challenges is the pro- embargo, uno de los desafíos más grandes que gram languages used for this devices. The effort for poseen son los lenguajes utilizados para la progra- recoding algorithms designed for CPUs is a critical mación de los dispositivos. El esfuerzo de reescri- problem. In this paper we review three of principal bir algoritmos diseñados originalmente para CPU frameworks for programming CUDA devices com- es uno de los mayores problemas. En este artículo pared with the new directives introduced on the se revisan tres frameworks de programación para OpenMP 4 standard resolving the Jacobi iterative la tecnología CUDA y se realiza una compara- method. ción con el reciente estándar OpenMP versión 4, Keywords: CUDA, Jacobbi method, OmpSS, Ope- resolviendo el método iterativo de Jacobi. nACC, OpenMP, OpenMP, Parallel Programming. Palabras clave: Método de Jacobi, OmpSS, Ope- nACC, OpenMP, Programación paralela.
* Network Engineering with Master degree on software engineering and Free Software construction, minor degree on applied mathematics and network software construction. Now running Doctorate studies on Engineering at Universidad Distrital and works as principal architect on RUNT. Now works in focus on Parallel Programming, high performance computing, computational numerical simulation and numerical weather forecast. E-mail: [email protected] ** Engineer in meteorology from the University of Leningrad, with a doctorate in physical-mathematical sciences of State Moscow University, pioneer in the area of meteorology in Colombia, in charge of meteorology graduate at Universidad Nacional de Colombia, researcher and director of more than 12 graduate theses in meteorology dynamic area and numerical forecast, air quality, efficient use of climate models and weather. He is currently a full professor of the Faculty of Geosciences at Universidad Nacional de Colombia. E-mail: gdmontoyag@ unal.edu.co *** System Engineering, PhD and Master degree on Informatics, director of research group GIIRA with focus on Social Network Analyzing, eLearning and data visualization. He is currently associate professor of Engineering Faculty at Universidad Distrital. E-mail: cemonte- [email protected]
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 160 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
INTRODUCTION is framework focusses on task decomposition para- digm for developing parallel applications on cluster Since 10 years ago, the massively parallel proces- environments with heterogeneous architectures. It sors have used the GPUs as principal element on provides a set of compiler directives that can be the new approach in parallel programming; it’s used to annotate a sequential code. Additional fea- evolved from a graphics-specific accelerator to a tures have been added to support the use of ac- general-purpose computing device and at this time celerators like GPUs. OmpSS is based on StartsS a is considered to be in the era of GPUs. (Nickolls & task based programming model. It is based on an- Dally, 2010). However, the main obstacle for large notating a serial application with directives that are adoption on the programmer community has been translated by the compiler. With it, the same pro- the lack of standards that allow programming on gram that runs sequentially in a node with a single unified form different existing hardware solutions GPU can run in parallel in multiple GPUs either (Nickolls & Dally, 2010). The most important play- local (single node) or remote (cluster of GPUs). Be- er on GPU solutions is Nvidia ® with the CUDA® sides performing a task-based parallelization, the language programming and his own compiler runtime system moves the data as needed between (nvcc) (Hill & Marty, 2008), with thousands of in- the different nodes and GPUs minimizing the im- stalled solutions and reward on top500 supercom- pact of communication by using affinity schedul- puter list, while the portability is the main problem. ing, caching, and by overlapping communication Some community project and some hardware alli- with the computational task. ance have proposed solutions for resolve this issue. OmpSs is based on the OpenMP programming OmpSS, OpenACC and OpenMPC have emerged model with modifications to its execution and as the most promising solutions (Vetter, 2012) using memory model. It also provides some extensions the OpenMP base model. In the last year, OpenMP for synchronization, data motion and heterogen- board released the version 4 (OpenMP, 2013) appli- eity support. cation program interface with support for external devices (including GPUs and Vector Processors). In 1) Execution model: OmpSs uses a thread-pool this paper we compare the four implementation of execution model instead of the traditional OpenMP Jacobi’s factorization, to show the advantages and fork-join model. The master thread starts the exe- disadvantages of each framework. cution and all other threads cooperate executing the work it creates (whether it is from work sharing or task constructs). Therefore, there is no need for a METHODOLOGY parallel region. Nesting of constructs allows other threads to generate work as well (Figure 1). The frameworks used working as extensions of 2) Memory model: OmpSs assumes that mul- #pragmas of the C languages offering the simplest tiple address spaces may exist. As such shared way to programming without development com- data may reside in memory locations that are not plicate and external elements. In the next section we describe the frameworks and give some imple- directly accessible from some of the computa- mentations examples. In the last part, we show the tional resources. Therefore, all parallel code can pure CUDA kernels implementations. only safely access private data and shared data which has been marked explicitly with our ex- Ompss tended syntax. This assumption is true even for SMP machines as the implementation may real- Ompss (a programming model form Barcelona Su- locate shared data to improve memory accesses percomputer center based on OpenMP and StarSs) (e.g., NUMA).
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 161 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
3) Extensions: with an attached accelerator device, such as a GPU. Function tasks: OmpSs allows to annotate func- Much of a user application executes on the host. tion declarations or definitions Cilk (Durán, Pérez, Compute intensive regions are offloaded to the ac- Ayguadé, Badia & Labarta, 2008), with a task dir- celerator device under control of the host. The de- ective. In this case, any call to the function creates vice executes parallel regions, which typically a new task that will execute the function body. The contain work- sharing loops, or kernels regions, data environment of the task is captured from the which typically contain one or more loops which function arguments. are executed as kernels on the accelerator. Even in Dependency synchronization: OmpSs inte- accelerator-targeted regions, the host may orches- grates the StarSs dependence support (Durán et al., trate the execution by allocating memory on the ac- 2008). It allows annotating tasks with three clauses: celerator device, initiating data transfer, sending the input, output, in/out. They allow expressing, re- code to the accelerator, passing arguments to the spectively, that a given task depends on some data compute region, queuing the device code, waiting produced before, which will produce some data, for completion, transferring results back to the host, or both. The syntax in the clause allows specifying and de-allocating memory (Figure 2). In most cases, scalars, arrays, pointers and pointed data. the host can queue a sequence of operations to be executed on the device, one after the other (Wolfe, 2013). The actual problems with OpenACC are rela- tionship with the only for-join model support and support for only commercial compilers can sup- port his directives (PGI, Cray and CAPS) (Wolfe, 2013; Reyes, López, fumero & Sande, 2012). In the last year, only one open source implementations has support (accULL) (Reyes & López-Rodríguez, 2012).
Figure 1. OmpSS execution model
Source: Barcelona supercomputing Center, p. 11. http://www. training.prace-ri.eu/uploads/tx_pracetmo/OmpSsQuickOver- viewXT.pdf
OpenACC
OpenACC is an industry standard proposed for hete- rogeneous computing on SuperComputer Conferen- ce 2011. OpenACC follows the OpenMP approach, with annotation on Sequential code with compi- ler directives (pragmas), indicating those regions of Figure 2. OpenACC execution model code susceptible to be executed in the GPU. Source: Barcelona supercomputing Center p. 11. http://www. The execution model targeted by OpenACC API- training.prace-ri.eu/uploads/tx_pracetmo/OmpSsQuickOver- enabled implementations is host-directed execution viewXT.pdf
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 162 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
OpenMPC OpenMPC adding a numbers of pragmas for an- notate OpenMP parallel regions and select opti- The OpenMPC (OpenMP extendent for CUDA) is mization regions. The pragmas added has the a framework to hide the complexity of program- following form: #pragma cuda <
Figure 3. OpenMP 4 execution model
Source: Intel Parallel OpenMP. http://www.theclassifiedsplus.com/video/video/axnA3kcLHK4/intel-parallel-openmp.html
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 163 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
JACOBI ITERATIVE METHOD Choose an initial guest to the solution x. for k=1,2,… Iterative methods are suitable for large scale linear for i=1,2,…n equations. There are three commonly used iterati- xi=0 ve methods: Jacobi’s method, Gauss method and for j=1,2,…,i-1,i+1,…n SOR iterative methods (Gravvanis, Filelis-Papado- (k-1) xi = xi + ai,jxj poulos & Lipitakis, 2013; Huang, Teng, Wahid & end Ko, 2009). xi = (bi + xi)/ ai,j The last two iterative methods convergence end speed is faster than Jacobi’s iterative method but x(k)=x lack of parallelism. They have advantages to Jaco- check convergence; continue bi method only when implemented in sequential if necessary fashion and executed on traditional CPUs. On the end other hand, Jacobi’s iterative method has inher- ent parallelism. It’s suitable to be implemented on This iterative method can be implemented on a CUDA or vector accelerators to run concurrently parallel form, using shared or distributed memory on many cores. The basic idea of Jacobi method is (Margaris, Souravlas & Roumeliotis, 2014) (Figure convert the system into equivalent system then we 4). For distributed memory, it needs some explicit solved Equation (1) and Equation (2): synchronization and data out-process data copy. In share memory, it needs distribution and data (1) merge in memory. It uses the following method:
(3)
Figure 4. Parallel form of Jacobi method on shared On each iteration we solve : memory (Alsemmeri, n.d.) Source: Parallel Jacobi Algorithm https://www.cs.wmich. edu/~elise/courses/cs626/s12/PARALLEL-JACOBI-ALGO- (2) RITHM11.pptx OPENMP IMPLEMENTATION Where the values from the (k-1) iteration are used to compute the values for the kth iteration. int n=LIMIT_N; The pseudo code for Jacobi method (Dongarra et int m=LIMIT_M; al., 2008): A[n][m]; // the D-1 matrix
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 164 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
Anew [n][m]; OPENACC IMPLEMENTATION y_vector[n]; //b vector //fill the matriz with initial conditions int n=LIMIT_N; … int m=LIMIT_M; #pragma omp parallel for shared (m, n, Anew, A[n][m]; // the D-1 matrix A) Anew [n][m]; for (int j = 1; j < n-1; j++) y_vector[n]; //b vector { //fill the matriz with initial conditions for (int i = 1; i < m-1; i++ ) … { #pragma omp parallel for shared (m, n, Anew, Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] A) + A[j-1][i] + A[j+1][i]); #pragma acc kernels error = fmaxf (error, fabsf(Anew[j][i]-A[j] for( int j = 1; j < n-1; j++) [i])); { } for( int i = 1; i < m-1; i++ ) } { #pragma omp parallel for shared (m, n, Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] Anew, A) + A[j-1][i] + A[j+1][i]); for (int j = 1; j < n-1; j++) error = fmaxf( error, fabsf(Anew[j][i]-A[j] { [i])); for (int i = 1; i < m-1; i++ ) } { } A[j][i] = Anew[j][i]; #pragma omp parallel for shared (m, n, Anew, } A) } #pragma acc kernels if (iter % 100 == 0) printf(“%5d, %0.6f\n”, for( int j = 1; j < n-1; j++) iter, error); { iter++; for( int i = 1; i < m-1; i++ ) } { …//print the result A[j][i] = Anew[j][i]; } In this section, the for-joint model appears on } section annotate with #pragma omp parallel for if(iter % 100 == 0) printf(“%5d, %0.6f\n”, shared (m, n, Anew, A) where every threads (nor- iter, error); mally equals to cores) on system running a copy iter++; of code with different data section shared all vari- } ables named on shared() section. When the size …//print the result of m and n is minor or equals to number of cores, the performance is similar on GPUs and CPUs, In this implementation appears a new annota- but when the size if much higher that number of tion #pragma acc kernels, where it indicates that cores available, the performance of GPUs increas- the code will be executed. This simple annotation es because the parallelism level is higher (Fowers, hides a complex implementation of CUDA kernel, Brown, Cooke & Stitt, 2012; Zhang, Miao, & Wang, the copy of data from host to devices and devices 2009). to GPU and definition of grid of threads and the
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 165 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
data manipulation (Amorim & Haase, 2009; Sand- cudaMemcpy( X_old_d, x_old, sizeof(float) ers & Kandrot, 2011; Zhang et al., 2009). * dim, cudaMemcpyHostToDevice); matMultVec<<
float err = 1.0; // 3. copy the new X back to the Host Memory // set up the memory for GPU cudaMemcpy( X, X_d, sizeof(float) * dim, float * LU_d; cudaMemcpyDeviceToHost); float * B_d; float * diag_d; // 4. calculate the norm of X_new - X_old float *X_d, *X_old_d; substract<<
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 166 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
rowIdx += gridDim.x * blockDim.x; vec[tid] = vec[dim-1]; } } __syncthreads(); } } __syncthreads(); // sync all threads // cuda kernel for vector subtraction dim /= 2; // make the vector half size short. __global__ void substract(float *a_d, } float *b_d, } float *c_d, int dim) The effort for writing three kernels, manage- { ment the logic of grids dimensions, copy data from int tid = threadIdx.x + blockIdx.x * blockDim.x; hosts to GPU and GPU to host, the aspect of syn- while ( tid < dim ) chronization thread on the groups of Threads on { GPU and some aspects as ThreadID calculation c_d[tid] = a_d[tid] - b_d[tid]; required high computation on hardware devices tid += gridDim.x * blockDim.x; and code programming. } } TEST TECHNIQUE
__global__ void VecMax(float * vec, int dim) We take the three implementations of Jacobi me- { thod (Pure OpenMP, OpenACC, OpenMPC) and int tid = threadIdx.x + blockIdx.x * blockDim.x; running it on SUT (System Under Test) of Table 1, while (dim > 1) and make multiples running with square matri- { ces of incremental sizes (Table 2), take processing int mid = dim / 2; // get the half size time for analyzing the performance (Kim, n.d.; if (tid < mid) // filter the active thread Sun & Gustafson, 1991) against the code number { lines needed on the algorithm. if (vec[tid] < vec[tid+mid] ) // get the larger one between vec[tid] and vec[tid+mid] Table 1. Characteristics of System under Test vec[tid] = vec[tid+mid]; // and store the larger one in vec[tid] System Supermicro SYS-1027GR-TRF Intel® Xeon® 10-core E5-2680 V2 CPUs @ } CPU 2.80 GHz Memory 32GB DDR3 1600MHz //deal with the odd case NVIDIA Kepler K40 GPUs, 2880 Cuda Cores, GPU if (dim % 2 ) // if dim is odd...we need care Memory 12GB about the last element OS Red Hat Enterprise Linux Server release 6.4 { PGI Compiler Accelerator Fortran/C/C++ 14 Compiler if (tid == 0 ) // only use the vec[0] to com- release 9 for Linux pare with vec[dim-1] L2 Cache 256K { L3 Cache 25MB if (vec[tid] < vec[dim-1] )
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 167 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
Table 2. Matrix size performance comparison. interchange between memory segments of the pro- cessors (CPUs) and devices. Besides, it is neces- sary that languages have support for working with Matrix Mean Running size Implementation two or more devices on parallel using the same (N) Time (Seconds) code but running segments of high parallelism in OpenMP with 20 Threads 26.578 automatically form. OpenMP is the de facto stan- 2048 OpenACC 74,970 dard for shared memory programming model, but OpenMPC 69,890 the support for heterogeneous devices (Gpus, ac- OpenMP with 20 Threads 74,280 celerators, fpga, etc.) is in very early stage, the 4096 OpenACC 92,320 new frameworks and industrial API need help for a OpenMPC 75,120 growing and maturating standard. OpenMP with 20 Threads 116,23 8096 OpenACC 98,00 FINANCING OpenMPC 101,420 OpenMP with 20 Threads 522.320 This research was developed with own resources, 16192 OpenACC 190,230 in the Doctoral thesis process at Universidad Dis- OpenMPC 222,320 trital Francisco José de Caldas.
RESULT FUTURE WORKS
The performance on the three frameworks (OpenMP It is necessary to analyze the impact of frameworks v. 4, OpenACC, OpenMPC) presents a similar re- for multiple devices on the same host for mana- sult on square matrices with size of 2048; howev- gement problems of memory locality, unified er, if the size ingresses to 4096, the performance memory between CPU and devices and I/O ope- is little high on OpenACC implementation. With rations from devices using the most recent ver- huge matrices greater than 12288, another factor sion of its standard framework. With the support of as cache L2 and L3 has impact on process of data copy from host to device (Bader & Weidendorfer, GPUs device on the following version of GCC5, it 2009; Barragan & Steves, 2011; Gupta, Xiang, & is possible to obtain a compiler performance com- Zhou, 2013). The performance of using a frame- parison between commercial and standards open work against using direct GPU CUDA languages, source solutions. Furthermore, it’s necessary to re- was just a bit (<5%), but the number of code lines view the impact of using cluster with multi-devi- was 70% minor. This shows that the framework of- ces, messaging pass and MPI integration for high fers useful programming tools with computational performance computing using the language exten- performance advantage; however, the performance sions (Schaa & Kaeli, 2009). gains on GPUs is on region of algorithms with high level of parallelism, and low data coupling with the ACKNOWLEDGMENT host code (Dannert, Marek, & Rampp, 2013). We express our acknowledgment to director of CONCLUSION high performance center of Universidad Industrial de Santander, the director of advanced computing The new devices for high performance compu- center of Universidad Distrital and Pak Liu Tech- ting need standardized methods and languages nical responsible for application characterization, that permits interoperability, easy programming profiling and testing of HPC Advisor Council www. and well defined interfaces for integrating the data hpcadvisorycouncil.com
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 168 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
REFERENCES of the 6th Balkan Conference in Informatics on - BCI ’13, 23. doi:10.1145/2490257.2490266 Alsemmeri, M. (n.d.). CS 6260. Advanced Parallel Com- Gupta, S., Xiang, P. & Zhou, H. (2013). Analyzing lo- putations Course. Retrieved October 3, 2014, from cality of memory references in GPU architectures. https://cs.wmich.edu/elise/courses/cs626/home.htm … of the ACM SIGPLAN Workshop on Memory …, Amorim, R. & Haase, G. (2009). Comparing CUDA and 1–2. doi:10.1145/2492408.2492423 OpenGL implementations for a Jacobi iteration. Hill, M. & Marty, M. (2008). Amdahl’s law in the mul- … & Simulation, 2009. …, 22–32. doi:10.1109/ ticore era. Computer, 41(7), 33–38. doi:10.1109/ HPCSIM.2009.5192847 MC.2008.209 Bader, M. & Weidendorfer, J. (2009). Exploiting me- Huang, P., Teng, D., Wahid, K. & Ko, S.-B. (2009). Per- mory hierarchies in scientific computing. 2009 formance evaluation of hardware/software co- International Conference on High Performan- design of iterative methods of linear systems. ce Computing & Simulation, 33–35. doi:10.1109/ 2009 3rd International Conference on Signals, HPCSIM.2009.5192891 Circuits and Systems (SCS), 1–5. doi:10.1109/ Barragan, E. H. & Steves, J. J. (2011). Performance analy- ICSCS.2009.5414164 sis on multicore system using PAPI. In 2011 6th Kim, Y. (2013). Performance tuning techniques for GPU Colombian Computing Congress (CCC) (pp. 1–5). and MIC. 2013 Programming weather, climate, and IEEE. doi:10.1109/COLOMCC.2011.5936277 earth-system models on heterogeneous multi-core Dannert, T., Marek, A. & Rampp, M. (2013). Porting Lar- platforms. Boulder, CO, National Oceanic and At- ge HPC Applications to GPU Clusters: The Codes mospheric Administration. GENE and VERTEX. Retrieved from http://arxiv.org/ Lee, S. & Eigenmann, R. (2010). OpenMPC: Extended abs/1310.1485v1 OpenMP programming and tuning for GPUs. Pro- Dave, C., Bae, H., Min, S. & Lee, S. (2009). Cetus: A sour- ceedings of the 2010 ACM/IEEE International …, ce-to-source compiler infrastructure for multicores. (November), 1–11. doi:10.1109/SC.2010.36 Computer (0429535). Retrieved from http://ieeex- Margaris, A., Souravlas, S. & Roumeliotis, M. (2014). plore.ieee.org/xpls/abs_all.jsp?arnumber=5353460 Parallel Implementations of the Jacobi Linear Dongarra, J., Golub, G. H., Grosse, E., Moler, C. & Moo- Algebraic Systems Solve. arXiv Preprint arX- re, K. (2008). Netlib and NA-Net: Building a Scien- iv:1403.5805, 161–172. Retrieved from http://arx- tific Computing Community. Annals of the History iv.org/abs/1403.5805 of Computing, IEEE, 30(2), 30–41. doi:10.1109/ Meijerink, J. & Vorst, H. van der. (1977). An iterative MAHC.2008.29 solution method for linear systems of which the Durán, A. et al. (2008). Extending the OpenMP tas- coefficient matrix is a symmetric -matrix. Mathe- king model to allow dependent tasks. Lectu- matics of Computation, 31(137), 148–162. Retrie- re Notes in Computer Science, 5004, 111–122. ved from http://www.ams.org/mcom/1977-31-137/ doi:10.1007/978-3-540-79561-2_10 S0025-5718-1977-0438681-4/ Fowers, J., Brown, G., Cooke, P., & Stitt, G. (2012). A Nickolls, J. & Dally, W. (2010). The GPU compu- performance and energy comparison of FPGAs, ting era. Micro, IEEE, 30(2), 56–69. doi:10.1109/ GPUs, and multicores for sliding-window appli- MM.2010.41 cations. Proceedings of the ACM/SIGDA …, 47. OpenMP, A. (2013). OpenMP application program doi:10.1145/2145694.2145704 interface, v. 4.0. OpenMP Architecture Review Gravvanis, G. a., Filelis-Papadopoulos, C. K. & Lipitakis, Board. Retrieved from http://scholar.google.com/ E. a. (2013). On numerical modeling performance of scholar?hl=en&btnG=Search&q=intitle:OpenMP- generalized preconditioned methods. Proceedings +Application+Program+Interface#1
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 169 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro
Podobas, A., Brorsson, M. & Faxén, K. (2010). A compa- Symposium on Parallel & Distributed Processing, rison of some recent task-based parallel program- 1–12. doi:10.1109/IPDPS.2009.5161068 ming models, 1–14. Retrieved from http://soda. Sun, X. & Gustafson, J. (1991). Toward a better para- swedish-ict.se/3869/ llel performance metric. Parallel Computing, 1093– Reyes, R., López, I., Fumero, J. & Sande, F. De. (2012). A 1109. Retrieved from http://www.sciencedirect. Comparative Study of OpenACC Implementations. com/science/article/pii/S0167819105800286 Jornadas Sarteco. Retrieved from http://www.jorna- Vetter, J. (2012). On the road to Exascale: lessons from con- dassarteco.org/js2012/papers/paper_150.pdf temporary scalable GPU systems. … /A* CRC Works- Reyes, R. & López-Rodríguez, I. (2012). accULL: An hop on Accelerator Technologies for …. Retrieved OpenACC implementation with CUDA and Open- from http://www.acrc.a-star.edu.sg/astaratipreg_2012/ CL support. Euro-Par 2012 Parallel …, 2(228398), Proceedings/Presentation - Jeffrey Vetter.pdf 871–882. Retrieved from http://link.springer.com/ Wang, Y. (n.d.). Jacobi iteration on GPU. Retrieved Sep- chapter/10.1007/978-3-642-32820-6_86 tember 25, 2014, from http://doubletony-pblog. Sanders, J. & Kandrot, E. (2011). CUDA by Example. azurewebsites.net/jacboi_gpu.html An Introduction to General-Purpose GPU Pro- Wolfe, M. (2013). The OpenACC Application Program- gramming …. Retrieved from http://scholar.google. ming Interface, 2. com/scholar?hl=en&btnG=Search&q=intitle:CUD- Zhang, Z., Miao, Q. & Wang, Y. (2009). CUDA-Ba- A+by+Example#1 sed Jacobi’s Iterative Method. Computer Scien- Schaa, D. & Kaeli, D. (2009). Exploring the multi- ce-Technology and …, 259–262. doi:10.1109/ ple-GPU design space. 2009 IEEE International IFCSTA.2009.68
Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 170 ]