Redalyc.Parallel Programming Languages on Heterogeneous

Tecnura ISSN: 0123-921X [email protected] Universidad Distrital Francisco José de Caldas Colombia Hernández B, Esteban; Montoya Gaviria, Gerardo de Jesús; Montenegro, Carlos Enrique Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Tecnura, vol. 18, núm. 2, diciembre, 2014, pp. 160-170 Universidad Distrital Francisco José de Caldas Bogotá, Colombia Available in: http://www.redalyc.org/articulo.oa?id=257059813014 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative Tecnura http://revistas.udistrital.edu.co/ojs/index.php/Tecnura/issue/view/687 DOI: http://doi.org/10.14483/udistrital.jour.tecnura.2014.DSE1.a14 REFLEXIÓN Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Lenguajes para programación paralela en arquitecturas heterogéneas utilizando openmpc, ompss, openacc y openmp Esteban Hernández B.*, Gerardo de Jesús Montoya Gaviria**, Carlos Enrique Montenegro*** Fecha de recepción: June 10tyh, 2014 Fecha de aceptación: November 4t, 2014 Citation / Para citar este artículo: Hernández, E., Gaviria, G. de J. M., & Montenegro, C. E. (2014). Para- llel programming languages on heterogeneous architectures using OPENMPC, OMPSS, OPENACC and OPENMP. Revista Tecnura, 18 (Edición especial doctorado), 160–170. doi: 10.14483/udistrital.jour.tecnura.2014.DSE1.a14 RESUMEN ABSTRACT En el campo de la programación paralela, ha arri- On the field of parallel programing has emerged a bado un nuevo gran jugador en los últimos 10 new big player in the last 10 years. The GPU’s have años. Las GPU han tomado una importancia re- taken a relevant importance on scientific computing levante en la computación científica debido a because they offer a high performance computing, que ofrecen alto rendimiento computacional, low cost and simplicity of implementation. Howe- bajo costos y simplicidad de implementación; sin ver, one of the most important challenges is the pro- embargo, uno de los desafíos más grandes que gram languages used for this devices. The effort for poseen son los lenguajes utilizados para la progra- recoding algorithms designed for CPUs is a critical mación de los dispositivos. El esfuerzo de reescri- problem. In this paper we review three of principal bir algoritmos diseñados originalmente para CPU frameworks for programming CUDA devices com- es uno de los mayores problemas. En este artículo pared with the new directives introduced on the se revisan tres frameworks de programación para OpenMP 4 standard resolving the Jacobi iterative la tecnología CUDA y se realiza una compara- method. ción con el reciente estándar OpenMP versión 4, Keywords: CUDA, Jacobbi method, OmpSS, Ope- resolviendo el método iterativo de Jacobi. nACC, OpenMP, OpenMP, Parallel Programming. Palabras clave: Método de Jacobi, OmpSS, Ope- nACC, OpenMP, Programación paralela. * Network Engineering with Master degree on software engineering and Free Software construction, minor degree on applied mathematics and network software construction. Now running Doctorate studies on Engineering at Universidad Distrital and works as principal architect on RUNT. Now works in focus on Parallel Programming, high performance computing, computational numerical simulation and numerical weather forecast. E-mail: [email protected] ** Engineer in meteorology from the University of Leningrad, with a doctorate in physical-mathematical sciences of State Moscow University, pioneer in the area of meteorology in Colombia, in charge of meteorology graduate at Universidad Nacional de Colombia, researcher and director of more than 12 graduate theses in meteorology dynamic area and numerical forecast, air quality, efficient use of climate models and weather. He is currently a full professor of the Faculty of Geosciences at Universidad Nacional de Colombia. E-mail: gdmontoyag@ unal.edu.co *** System Engineering, PhD and Master degree on Informatics, director of research group GIIRA with focus on Social Network Analyzing, eLearning and data visualization. He is currently associate professor of Engineering Faculty at Universidad Distrital. E-mail: cemonte- [email protected] Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 160 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro INTRODUCTION is framework focusses on task decomposition para- digm for developing parallel applications on cluster Since 10 years ago, the massively parallel proces- environments with heterogeneous architectures. It sors have used the GPUs as principal element on provides a set of compiler directives that can be the new approach in parallel programming; it’s used to annotate a sequential code. Additional fea- evolved from a graphics-specific accelerator to a tures have been added to support the use of ac- general-purpose computing device and at this time celerators like GPUs. OmpSS is based on StartsS a is considered to be in the era of GPUs. (Nickolls & task based programming model. It is based on an- Dally, 2010). However, the main obstacle for large notating a serial application with directives that are adoption on the programmer community has been translated by the compiler. With it, the same pro- the lack of standards that allow programming on gram that runs sequentially in a node with a single unified form different existing hardware solutions GPU can run in parallel in multiple GPUs either (Nickolls & Dally, 2010). The most important play- local (single node) or remote (cluster of GPUs). Be- er on GPU solutions is Nvidia ® with the CUDA® sides performing a task-based parallelization, the language programming and his own compiler runtime system moves the data as needed between (nvcc) (Hill & Marty, 2008), with thousands of in- the different nodes and GPUs minimizing the im- stalled solutions and reward on top500 supercom- pact of communication by using affinity schedul- puter list, while the portability is the main problem. ing, caching, and by overlapping communication Some community project and some hardware alli- with the computational task. ance have proposed solutions for resolve this issue. OmpSs is based on the OpenMP programming OmpSS, OpenACC and OpenMPC have emerged model with modifications to its execution and as the most promising solutions (Vetter, 2012) using memory model. It also provides some extensions the OpenMP base model. In the last year, OpenMP for synchronization, data motion and heterogen- board released the version 4 (OpenMP, 2013) appli- eity support. cation program interface with support for external devices (including GPUs and Vector Processors). In 1) Execution model: OmpSs uses a thread-pool this paper we compare the four implementation of execution model instead of the traditional OpenMP Jacobi’s factorization, to show the advantages and fork-join model. The master thread starts the exe- disadvantages of each framework. cution and all other threads cooperate executing the work it creates (whether it is from work sharing or task constructs). Therefore, there is no need for a METHODOLOGY parallel region. Nesting of constructs allows other threads to generate work as well (Figure 1). The frameworks used working as extensions of 2) Memory model: OmpSs assumes that mul- #pragmas of the C languages offering the simplest tiple address spaces may exist. As such shared way to programming without development com- data may reside in memory locations that are not plicate and external elements. In the next section we describe the frameworks and give some imple- directly accessible from some of the computa- mentations examples. In the last part, we show the tional resources. Therefore, all parallel code can pure CUDA kernels implementations. only safely access private data and shared data which has been marked explicitly with our ex- Ompss tended syntax. This assumption is true even for SMP machines as the implementation may real- Ompss (a programming model form Barcelona Su- locate shared data to improve memory accesses percomputer center based on OpenMP and StarSs) (e.g., NUMA). Tecnura • p-ISSN: 0123-921X • e-ISSN: 2248-7638 • Vol. 18 - Special Edition Doctorate • December 2014 • pp. 160-170 [ 161 ] Parallel programming languages on heterogeneous architectures using openmpc, ompss, openacc and openmp Esteban Hernández B., Gerardo de Jesús Montoya Gaviria, Carlos Enrique Montenegro 3) Extensions: with an attached accelerator device, such as a GPU. Function tasks: OmpSs allows to annotate func- Much of a user application executes on the host. tion declarations or definitions Cilk (Durán, Pérez, Compute intensive regions are offloaded to the ac- Ayguadé, Badia & Labarta, 2008), with a task dir- celerator device under control of the host. The de- ective. In this case, any call to the function creates vice executes parallel regions, which typically a new task that will execute the function body. The contain work- sharing loops, or kernels regions, data environment of the task is captured from the which typically contain one or more loops which function arguments. are executed as kernels on the accelerator. Even in Dependency synchronization:

Load more