JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 Parallware Trainer: Interactive Tool for Experiential Learning of Parallel Programming using OpenMP and OpenACC Manuel Arenaz, and Sergio Ortega, and Ernesto Guerrero, and Fernanda Foertter,

Abstract—STEM education plays a key role in the sustained growth and stability of the US economy and worldwide. There is currently a shortage of a skilled STEM workforce, and that gap is expected to grow widely in the next decade. It is key to widen the audience of STEM people trained in parallel programming targeting parallel architectures like Xeon, IBM Power, NVIDIA GPU and Intel Xeon Phi. In this regard, the standards OpenMP 4.5 and OpenACC 2.5 offer pragma-based parallel programming paradigms that promise performance portability and higher productivity. This paper presents Parallware Trainer, a new interactive tool for high-productivity STEM education and training in parallel programming using OpenMP 4.5 and OpenACC 2.5. It enables experiential learning by providing an interactive, real-time GUI with editor capabilities to assist in the design, implementation and benchmarking of OpenMP/OpenACC-enabled parallel code. We envision Parallware Trainer as a key enabler for STEM education from PhD down to undergraduate in computer science, maths, physics,.... This paper also describes a success story resulting from a GPU Hackathon organized at the Supercomputing Center of Galicia (CESGA). We present the progress of a 2-people team of the EDANYA group learning how to address the parallelization of a simulation code for prediction of tsunamis used by the National Oceanic and Atmospheric Administration (NOAA). Index Terms—STEM education, experiential learning, parallel programming, OpenMP 4.5, OpenACC 2.5, Parallware Trainer. Fig. 1. The HPC education and training pyramid.

I.INTRODUCTION The view of the landscape from Berke- PC education and training is organized today mostly ley [5] addresses the so-called parallel challenge, which is around courses, workshops and hackathons. As shown H described as ”Writing programs that scale with increasing in the pyramid of Fig. 1, in courses the participants passively numbers of cores should be as easy as writing programs for listen to a lecture or presentation, and apply parallel program- sequential computers”. Thus, a Parallel Bridge is used to ming concepts to simple example codes in hands-on sessions. illustrate that Software is the main problem in bridging the gap Workshops increase learning retention through interactive between user Applications and the parallel Hardware industry. training activities between the participants. And hackathons Today, the HPC field still widely recognizes that software is are events where people come together to solve problems, pain #1. This paper addresses this by extending the parallel participating in groups of about 2-5 individuals that take out bridge as shown in Fig. 2. The new tower Code highlights their laptops and dive into their own problems. Hackathons that the features of the code implemented by the programmer allow experiential learning as participants work with a coach directly impact on the productivity of the parallelization pro- to learn through immediate practice in the optimization and cess. Thus, best practices on parallel programming typically parallelization of their own code. However, hackathons are recommend, for example, to use stride-1 memory accesses and expensive small events that train only a few people throughout to prefer structures-of-arrays instead of arrays-of-structures. the year. The main contribution of this paper is Parallware M. Arenaz is with the Department of Computer Engineering at University of Trainer [1], a new commercial software that aims at bringing A Coruna,˜ and with Appentra Solutions, e-mail: [email protected]. the benefits of workshops and hackathons to a broader audi- S. Ortega and E. Guerrero are with EDANYA group, University of Malaga, ence of STEM people. It is a new interactive, real-time GUI Spain e-mail: [email protected], [email protected]. F. Foertter is with ORNL e-mail: [email protected]. for high-productivity HPC education and training that enables Manuscript received September 8, 2017; revised September 8, 2017. self-learning of best practices on parallel programming with JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

Fig. 2. An extension of the parallel bridge of Berkeley’s view of the parallel computing landscape. The new tower Code inserted between Applications and parallel Hardware highlights the impact of the program code in the productivity of the parallelization .

OpenMP 4.5 and OpenACC 2.5. Powered by the hierarchical messages are written using the notation and terminology of classification engine of Parallware technology [4], it discovers classical dependence analysis theory, a mathematical approach parallel patterns in sequential code, provides a ranking of to discover parallelism in sequential codes. Thus, compiler’s parallelization strategies, and generates pragma-based parallel user messages typically report failures in discovering paral- source code using standards OpenMP 4.5 and OpenACC 2.5. lelism by pointing to source code instructions that may intro- The tool is also pre-loaded with sample codes that cover the duce true/output/anti-dependences during parallel execution. most important parallel patterns used in the codes available In contrast, Parallware uses a new computational approach in the CORAL Benchmarks [6] and the NAS Parallel Bench- that consists of a hierarchical classification scheme for depen- marks [7]. As shown in Fig. 2, the programmers interact with dence analysis. Parallware Trainer reports algorithmic features compilers through Parallware’s parallel patterns, which hide found in the code in terms of parallel patterns [4], such as the complexity of the dependency terminology that is difficult fully parallel loops, parallel scalar reductions and parallel to understand for scientists and engineers. Overall, Parallware sparse reductions. In addition, it provides a ranking of several Trainer reduces the costs of HPC training and education, the parallelization strategies that are applicable to the code, and main advantages being: (1) reduction of learning effort and allows the user to generate, study and run implementations of increase of learning retention through learn-by-doing, (2) high those strategies. These features are not typically available in availability 24x7, (3) broader audience of STEM people not production-level compilers. located near HPC training sites. Web-based HPC training [3] typically include webinar se- The rest of the paper is organized as follows. Section II ries, video tutorials and code samples. However, these HPC discusses related work. Section III presents the new tool training environments do not enable experiential learning Parallware Trainer, describing its GUI layout and its technical because the environment does not provide any feedback about features. The current technological roadmap under develop- the problems encountered when the concepts are applied to ment in order to find the product-market fit is also sketched. the code of the developer. Parallware Trainer is a step forward Section IV describes the experience of staff of the EDANYA that could be integrated in third-party web-based training group using Parallware Trainer to learn how to parallelize a environments as well. simulation code to help predicting tsunamis by the NOAA. Finally, Section V presents conclusions and future work. III.PARALLWARE TRAINER Parallware Trainer [1] is a new interactive commercial II.RELATED WORK tool for high-productivity HPC education and training using There are not so many tools dedicated to training in parallel OpenMP 4.5 and OpenACC 2.5. It allows experiential learning programming with OpenMP and OpenACC. HPC centers by providing an interactive, real-time GUI with editor capabili- organize courses and workshops [2] that typically teach the ties to assist in the design and implementation of parallel code. most relevant parallel programming concepts using step-by- Powered by the hierarchical classification engine of Parallware step instructions and hands-on labs. Production-level compilers technology, it discovers parallelism using parallel patterns, (e.g., Intel ICC, GNU GCC, NVIDIA PGI) are not of practical and implements those patterns using standards OpenMP 4.5 use to help understanding the technical reasons behind success and OpenACC 2.5 (see video tutorials How to use Parallware and failure in the parallelization of a code. Compiler’s user Trainer available at www.parallware.com). JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Fig. 3. Main screen of the GUI of Parallware Trainer. Layout composed of five panels: 1) Project manager, 2) Source code editor, 3) Parallel code editor, 4) Execution console, and 5) Parallware console. When clicking on a gutter attached to a scope (e.g., function, loop), the GUI shows a window that enables the user to provide additional hints to control the behaviour of the Parallware core technology.

A. Graphical User Interface (GUI) benchmarks CORAL and NASA NPB: Atomic access, Built- in reduction and Variable privatization (see details in [4]). The layout of the GUI of Parallware Trainer is shown in Note that unsupported combinations of hints are conveniently Fig. 3. It provides an environment for the edition, compilation reported to the user. and execution of sequential and OpenMP/OpenACC-enabled 3) Parallel Code Editor: Handle multiple parallel versions parallel code. Next, the panels of the main screen are described of the same source code, the one corresponding to the active in detail. tab in the source code editor (see the version pw atmux.c of 1) Project Manager: Handle multiple Parallware Trainer sequential code atmux.c in Fig. 3). By default, Parallware projects, and drag-and-drop the C/C++/ source code Trainer creates a parallel version automatically generated files that compose your projects. A Parallware Trainer project by Parallware core technology (see Fig. 3, version named consists of a directory with the tree structure shown in Fig. 4a. pw atmux.c). The user is allowed to create more parallel ver- It separates the source code (src), binaries (obj), executables sions for the same sequential code (see Fig. 3, version named (bin), external libraries (lib), external header files (inc), and atmux synchopt manual.c). Such versions can be edited to other resources such as data input files (res). The project fine-tune the OpenMP/OpenACC pragmas and clauses as directory also contains a Makefile generated automatically (see needed to improve the performance of the parallel code. file ATMUX.make in Fig. 4a), which provides targets to clean, Finally, note that the GUI requests additional information from build and run the code of the project (see an excerpt of the user as needed (see the check list at the bottom of the editor ATMUX.make in Fig. 4b). Note that the project directory is of file pw atmux.c). self-contained, and all the targets can also be executed from a 4) Execution Console: The GUI provides two execution terminal (outside of the Parallware Trainer GUI). profiles (see Fig. 3): Sequential, to build and run the code 2) Source Code Editor: Edit multiple source code files of using one ; and Parallel, to enable OpenMP/OpenACC the active project. The GUI provides syntax-highlighting for capabilities and run the code using multiple threads. The user C/C++/Fortran. Each scope of the source code (e.g., functions, is allowed to chose the preferred compiler suite (e.g., GCC, loops) is attached a gutter that enables the user to supply ICC, PGCC) and specify the flags needed to build and run the hints for controlling the analyses performed by Parallware core code. technology. As shown in Fig. 3, the user can select the parallel 5) Parallware Console: The user is allowed to browse the programming standard (OpenMP -or Default- and OpenACC), messages reported by Parallware core technology after the the device (CPU -or Default-, GPU and PHI), and the parallel analysis of the source code, the one corresponding to the active programming paradigm (Loop -or Default-, Offload, Task and tab in the source code editor. As shown in Fig. 3, Parallware SIMD). The GUI also allows the user to select between three reports the parallel patterns found in each loop, giving details parallel implementations of parallel reductions widely used in about the source code instructions, variables and operators JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

involved in the pattern (see Fig. 3, loop at line 9, message Parallel sparse reduction on variable ’y’). It also reports a ranking of applicable parallelization strategies, the ranking top being the strategy automatically selected and implemented (see the sparse accumulation at line 16 protected with an OpenMP pragma atomic). The resulting OpenMP/OpenACC pragmas inserted by Parallware can be studied in the parallel code editor. We close the description of the GUI of Parallware Trainer by mentioning that the tool is shipped with a tutorial and several self-learning examples of increasing technical complexity. The snapshots shown in Fig. 5 present four well-known sample codes with very different technical features (e.g., parallel pattern, unbalanced workload, arithmetic intensity). For the sake of clarity, a list of exercises proposed to the student for the sample code ATMUX is also presented.

B. Technological Roadmap for 2017-2018 Early prototypes of Parallware Trainer have been tested in PRACE PATC courses at Barcelona Supercomputing Center (BSC) since 2015. Since February 2017 we are conducting an early access program of Parallware Trainer to find the product- market fit. As of writing we have 50 participants from 7 (a) Tree view (project ATMUX). supercomputing centers (OLCF, NERSC, BSC, CESGA, LRZ, EPCC, TACC), 26 international universities and 7 companies or other organizations. In addition, key opinion leaders1 are already supporting the rolling launch of Parallware Trainer at SC17. The feedback received from the HPC community reinforces our vision that Parallware Trainer may become a key enabler in the STEM field for self-learning of parallel programming in computer science, maths, physics,.... at PhD and undergraduate levels. The technological roadmap of Parallware Trainer is guided by best practices on parallel programming with OpenMP and OpenACC. We have analyzed by hand the OpenMP/OpenACC implementations of well-known benchmark suites [4], more specifically, CORAL Benchmarks [6], NAS Parallel Bench- marks [7] and ORNL’s XRayTrace miniapp [8]. As a result, Parallware Trainer currently supports: • The most popular parallel patterns, namely, Parallel Forall, Parallel Scalar Reduction and Parallel Sparse Reduction (see details in [4]). • The parallel programming paradigms Loop and Offload for modern CPU devices (e.g., Intel Xeon, IBM Power) and NVIDIA GPUs. • OpenMP and OpenACC implementations of parallel scalar/sparse reductions using the approaches Atomic access, Built-in reduction and Variable privatization (see more details in [4]). The HPC race to build (pre-)exascale by 2024 is leading to the design of increasingly complex hardware and software stacks. From a hardware perspective, the tech- nological roadmap of Parallware Trainer is aligned with the

(b) Makefile (excerpt of ATMUX.make). 1(1) Fernanda Foertter (ORNL), recently elected as SIG HPC Education Vice Chair, co-designer of the technical features and the GUI of Parallware Fig. 4. Internals of Parallware Trainer projects. Trainer; (2) Xavier Martorell (BSC), promoter of early usage of Parallware Trainer in PRACE PATC courses at BSC; and (3) Dirk Pleiter (JSC), helping to increase awareness of Parallware Trainer at SC17. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

US Exascale Compute Project (ECP) [9] and will also support the Intel Xeon Phi accelerator. From a software perspective, we will add support for the parallel programming paradigms Task and SIMD, which are expected to have an increasingly important role in future (pre-)exascale supercomputers. Note that all of these options are shown in Fig. 3 in the snapshot of Parallware Trainer.

IV. CASE STUDY: EDANYA GROUP In the scope of the early access program of Parallware Trainer, we have organized a GPU Hackathon at the Su- percomputing Center of Galicia (CESGA) to collect di- rect feedback on the usage of the tool for self-learning of parallel programming using OpenMP and OpenACC (see http://www.appentra.com/cesgahack/). Hereafter, we present the self-learning experience of a 2-people team coming from the EDANYA group of the University of Malaga (Spain). The goal of the team was learning how to address the paral- lelization of a family of models for the simulation of geophys- ical flows. Among them, the code Tsunami-HySEA [10] has become very popular among the tsunami modelers community. Tsunami-HySEA is the numerical model of the HySEA family specifically designed for quake generated tsunami simulations. It combines robustness, reliability and good accuracy in a model based on a GPU faster than real time implementation. Tsunami-HySEA model implements in the same code the three parts of an earthquake generated tsunami: generation, (a) Sample codes. propagation, and coastal inundation. In the generation stage, Okada’s fault deformation model is used to predict the initial bottom deformation that is transmitted instantaneously to the sea surface generating the tsunami wave. The propagation and coastal inundation is modeled by the well known shallow- water PDE system, that is discretized by means of a second- order finite volume scheme. Tsunamy-Hysea has been adopted as official code for the Spanish and Italian tsunami early warning system and, recently, has been successfully evaluated for the National Tsunami Hazard Mitigation Program (U.S.) and has also been tested at the National Center for Tsunami Reserch (NCTR) of the National Oceanic and Atmospheric Administration (NOAA).

A. Preparation of the GPU Hackathon The GPU Hackathon at CESGA was oriented to R+D groups responsible for the development of simulation codes written in C/C++/Fortran. The EDANYA group presented Tsunami-HySEA, a C++ code that approximates the shallow- water PDE system using a second order finite volumes method. An interview with Prof. Manuel Castro, member of EDANYA team and responsible of the tsunami modeling project, stab- lished the main goal of the team: develop an OpenACC version and compare its performance with the existing CUDA version. The conversations also revealed that the Tsunami-HySEA code is too complex for a 2-people team to address its (b) Exercises proposed for ATMUX. parallelization in a 3-days GPU Hackathon. As a result, the Fig. 5. Tutorial and sample examples shipped with Parallware Trainer. EDANYA team was requested to write a miniapp that contains a simple version of Tsunami-HySEA. As shown in Table I, the miniapp is a C code that focuses on the propagation of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Tsunami-HySEA miniapp Model Hidrostatic shallow waters Tsunami features Generation Propagation Propagation Inundation Domain 3D 2D No. layers 1 2 Numerical scheme Finite volumes No. volumes 10 · 106 – 60 · 106 10 · 103 – 60 · 104 Bathimetry input Real world Simple math function Programming language C++ C TABLE I Fig. 6. Profiling of the miniapp of the EDANYA team using gprof. COMPARISONOFTHEMINIAPPUSEDBYTHE EDANYA TEAMWITH RESPECTTOTHEREALCODE TSUNAMI-HYSEA.

Listing 1. Pseudocode of the OpenMP-enabled EDANYA miniapp. 1 void main () { 2 #pragma omp parallel shared(var0,var1,varn,x, the tsunami in a 2D domain. It was designed to keep both the reconstructions,npar,ncapas,dx,dt,tiempo actual physics and the complexity of Tsunami-HySEA. This miniapp ,heps) 3 { reduces in one the order of the problem, allowing more 4 do { resolution in the vertical domain while keeping reasonably 5 var1 = c a l c u l a i t e r a c i o n ( var0 , d t ); 6 d t = c a l c u l a d t ( var0 ); low running times. Both codes implement the same hydrostatic 7 #pragma omp barrier shallow water model, which is run using different bathimetry 8 } while( t i e m p o a c t u a l

Problem size (No. volumes) 10000 20000 50000 100000 Listing 2. Pseudocode of the OpenACC-enabled EDANYA miniapp. OpenMP 3.1 running on multicore processor 1 thread 65.28 − 260.03 − 1839.99 − 7276.63 − 1 void main () { 2 threads 42.07 1.55× 163.56 1.59× 1043.50 1.76× 4420.28 1.65× 2 #pragma acc data copyin(reconstructions, var1) 4 threads 19.22 3.40× 75.11 3.46× 486.22 3.78× 2093.91 3.48× 3 { 6 threads 13.15 4.96× 51.50 5.05× 321.65 5.72× 1418.75 5.13× 4 do{ 5 var1 = c a l c u l a i t e r a c i o n ( var0 , d t ); 8 threads 10.03 6.51× 39.07 6.66× 243.21 7.57× 1072.03 6.79× 6 d t = c a l c u l a d t ( var0 ); OpenACC 2.5 running on accelerator processor 7 } while( t i e m p o a c t u a l

V. CONCLUSIONSAND FUTURE WORK [6] Department of Energy (DoE), CORAL Benchmark Codes, https://asc.llnl.gov/CORAL-benchmarks/, 2014 This paper shows evidences that Parallware Trainer has the [7] D.H. Bailey and E. Barszcz and J.T. Barton and D.S. Brown- potential to become an effective tool to enable experiential ing and R.L. Carter and L. Dagum and R.A. Fatoohi and P.O. learning of parallel programming using OpenMP 4.5 and Frederickson and T.A. Lasinski and R.S. Schreiber and H.D. Si- mon and V. Venkatakrishnan and S.K. Weeratunga, The NAS Paral- OpenACC 2.5. Powered by Parallware core technology, the lel Benchmarks - Summary and Preliminary Results, Proceedings of tool eases the discovery of the most popular parallel patterns the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing used in OpenMP/OpenACC-enabled scientific applications, ’91, 1991, 158–165, 8, https://www.nas.nasa.gov/publications/npb.html, 10.1145/125826.125925, ACM namely, Parallel Forall, Parallel Scalar Reduction and Parallel [8] Mark Berril, XRayTrace miniapp, https://code.ornl.gov/mbt/RayTrace- Sparse Reduction. The GUI supports implementing, build- miniapp, 2017 ing, running and benchmarking parallel patterns on multicore [9] Exascale Computing Project (ECP): Messina Update: The US Path to Exascale in 16 slides. See Programming Models, De- CPUs (e.g., Intel Xeon, IBM Power) and accelerator devices velopment Environment and Tools (Intel Xeon, IBM Power, In- (e.g., NVIDIA GPU, Intel Xeon Phi). tel Xeon Phi, NVIDIA GPU, OpenMP, OpenACC, task-based, The experience of the EDANYA team in GPU Hackathon at LLVM, FLANG). [https://www.hpcwire.com/2017/04/26/messina-update- u-s-path-exascale-15-slides/ last checked July 2017] CESGA showed that our methodology based on parallel pat- [10] M. de la Asuncion´ and M.J. Castro and E.D. Fernandez-Nieto´ and J.M. terns is: (1) simple, as STEM people without prior knowledge Mantas and S. Ortega and J.M. Gonzalez-Vida.´ Efficient GPU implemen- on parallel programming with OpenMP/OpenACC succeeded tation of a two waves TVD-WAF method for the two-dimensional one layer shallow water system on structured meshes. Computers & Fluids to discover parallelism in sparse codes; (2) powerful, as the 80:441–452, 2013. parallel patterns describe how to rewrite sequential code into efficient OpenMP/OpenACC-enabled parallel code. Starting from scratch and in only 3 days, the EDANYA team imple- Manuel Arenaz is CEO at Appentra Solutions and professor at the University of A Corua (Spain). mented an scalable OpenMP parallel version of their miniapp, He holds a PhD in Computer Science from the and defined the roadmap for the development of an OpenACC University of A Corua (2003) on advanced compiler version for NVIDIA GPUs. techniques for automatic parallelization of scientific codes. Recently, he co-founded Appentra Solutions As future work we plan to continue the validation of to commercialize products and services that take Parallware Trainer and its pattern-based methodology through advantage of Parallware, a new technology for se- more hackathons and control groups. We will improve the GUI mantic analysis of scientific HPC codes. for self-training, providing basic, intermediate and advanced- level courses as well as progress metrics to enable self- evaluation. In order to widen our target audience, we will add Sergio Ortega obtained the Degree in Mathematics at the University of Malaga (2005), the Master in support for Fortran, for SIMD execution, and for tasking with the department of Mathematical Analysis within the asynchronous execution, which are expected to be key in the program of Physics and Mathematics (FISYMAT) upcoming (pre-)exascale supercomputers. Finally, we also plan (2007), and his PhD ”High order finite volume schemes: GPU implementation and its application to to conduct proof-of-concept experiments using a new SaaS the simulation of geophysical flows” (2016). Since product based on Parallware Trainer. 2007 he works in the EDANYA group of the Uni- versity of Malaga. ACKNOWLEDGMENT This research has been partially supported by the Span- ish Government and FEDER through Research project Ernesto Guerrero-Fernndez studied Industrial En- gineering at the University of Malaga (2015) fol- MTM2015-70490-C2-1-R and Andalusian Government Re- lowed by a master in Industrial Mathematics at the search project P11-FQM-8179. Also thanks to the Super- University of (2017). He is computing Centre of Galicia (CESGA) for supporting the currently pursuing his Ph.D in applied mathematics with the EDANYA group at the University of Malaga organization of the GPU Hackathon and for providing access under the supervision of Manuel J. Castro, where to the FinisTerrae supercomputer. he is studying shallow water models and robust numerical algorithms for hyperbolic equations.

REFERENCES [1] Parallware Trainer, http://www.parallware.com/, Sep, 2017. [2] Swiss National Supercomputing Center (CSCS), Directive Based Fernanda Foertter is HPC User Support Specialist GPU Programming: OpenACC and OpenMP, http://www.cscs.ch/- and Programmer at the US Department of Energy’s /events/event detail/index.html?tx seminars pi1%5BshowUid%5D=161, (DOE’s) Oak Ridge Leadership Computing Facil- Sep, 2017. ity (OLCF). Computer geek interested in quantum [3] NVIDIA, Accelerate Applications on GPUs with OpenACC Directives, chemistry, application development, parallelization https://developer.nvidia.com/how-to-openacc, Sep, 2017. and clusters. Recently elected to the position of [4] Manuel Arenaz, Oscar Hernandez, Dirk Pleiter: The Technological SIGHPC Education Committee Vice Chair, where Roadmap of Parallware and its Alignment with the OpenPOWER Ecosys- she will support the goals of SIGHPC Education to tem, International Workshop on OpenPOWER for HPC (IWOPH17) co- promote increased knowledge and greater interest in located with ISC17, 2017. the educational and scientific aspects of HPC. [5] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. 2009. A view of the parallel computing landscape. Commun. ACM 52, 10 (October 2009), 56-67. DOI: https://doi.org/10.1145/1562764.1562783