OPENMP 4.5 VALIDATION AND VERIFICATION TESTSUITE

DESIGN AND IMPLEMENTATION FOR OFFLOADING FEATURES

by

Jose Manuel Monsalve Diaz

A Master Thesis submitted to the Faculty of the University of Delaware in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering

Winter 2020

2020 Jose Manuel Monsalve Diaz All Rights Reserved OPENMP 4.5 VALIDATION AND VERIFICATION TESTSUITE

DESIGN AND IMPLEMENTATION FOR OFFLOADING FEATURES

by

Jose Manuel Monsalve Diaz

Approved: Sunita Chandrasekaran, Ph.. Professor in charge of Master Thesis on behalf of the Advisory Committee

Approved: Guang R. Gao, Ph.D. Co-Professor in charge of Master Thesis on behalf of the Advisory Com- mittee

Approved: Kenneth E. Barner, Ph.D. Chair of the Department of Electrical and Computer Engineering

Approved: Levi Thompson, Ph.D. Dean of the College of Engineering

Approved: Douglas J. Doren, Ph.D. Interim Vice Provost for Graduate and Professional Education and Dean of the Graduate College ACKNOWLEDGMENTS

This material is based upon work supported by the U.S. Department of Energy, Office of Science, the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration under contract number DE-AC05-00OR22725. This project is a joint effort between several laboratories of the Department of Energy, and the University of Delaware. In particular, Oak Ridge National Laboratory and Argonne National Laboratory are both large contributors in this project. This project has been an effort of many people that are acknowledge in our website for their contributions. While my contribution has been considerable, there is also a large group of people involved in this project whose contributions are invaluable. In particular Dr. Swaroop Pophale (ORNL), Dr. Oscar Hernandez(ORNL), Dr. David E. Bernholdt(ORNL), Dr. Hal Finkel(ANL), and Professor Sunita Chandrasekaran (UD) are major leaders of this project which is part of the SOLLVE initiative of the Exascale Computing Project. Additionally, Sergio Pino MS, as well as undergraduate students Joshua Davis, Kyle Friedline and Thomas Huber have heavily contributed to this project, and they have created many of the tests presented in this work. Special recognition to other external collaborators such as many of the members of the OpenMP ARB, application developers that have submitted tests and ideas, as well as vendors that have contributed with hardware donations. In particular, special recognition to AMD and for their donations of GPGPU accelerators which we have used to run this testsuite. Also, special thanks to all the vendors that have provided us with feedback through the use of this software.

iii To my parents, my brother, my wife and all my family. They are who I would like to be one day.

iv TABLE OF CONTENTS

LIST OF TABLES ...... viii LIST OF FIGURES ...... ix LIST OF LISTINGS ...... xi ABSTRACT ...... xiii

Chapter

1 INTRODUCTION ...... 1

2 OBJECTIVES AND PROBLEM FORMULATION ...... 8

3 BACKGROUND AND MOTIVATION ...... 9

3.1 A brief history and overview of OpenMP ...... 9 3.2 OpenMP Accelerators Offloading support ...... 12 3.3 Implementation of OpenMP offloading support in ..... 15

3.3.1 Challenges for compilers ...... 16 3.3.2 Translating code for GPGPU device offloading ...... 17

3.3.2.1 Dynamic Parallelism ...... 19 3.3.2.2 If-Master coordination ...... 21 3.3.2.3 Executor/Inspector ...... 22

3.4 Compilers supporting OpenMP Offloading ...... 25

3.4.1 GNU GCC: ...... 25 3.4.2 LLVM: ...... 26 3.4.3 Other Vendors and Devices ...... 27

4 RELATED WORK ...... 29

v 5 TEST SUITE INFRASTRUCTURE DESIGN ...... 31

5.1 Test design and review process ...... 32 5.2 Infrastructure ...... 34

5.2.1 Development environment and website ...... 35 5.2.2 Folder Structure ...... 36 5.2.3 Tests structure, Header file and Module ...... 36 5.2.4 Makefile ...... 39

5.2.4.1 Rules for Makefile ...... 39 5.2.4.2 Options for Makefile ...... 40

5.2.4.2.1 CC, CXX, and FC: selection .. 40 5.2.4.2.2 OMP VERSION ...... 41 5.2.4.2.3 SOURCES ...... 42 5.2.4.2.4 TESTS TO RUN ...... 43 5.2.4.2.5 VERBOSE and VERBOSE TESTS ... 43 5.2.4.2.6 LOG and LOG ALL ...... 44 5.2.4.2.7 LOG DIR and BIN DIR ...... 44 5.2.4.2.8 SYSTEM, MODULE LOAD and ADD BATCH SCHED ...... 45 5.2.4.2.9 NO OFFLOADING ...... 45 5.2.4.2.10 REPORT ONLINE TAG and REPORT ONLINE APPEND ...... 46

5.3 System customization ...... 46 5.4 Results, Logs and Reports ...... 47

5.4.1 Raw format ...... 48 5.4.2 Summary Report ...... 48 5.4.3 JSON format ...... 49 5.4.4 CSV Format ...... 50 5.4.5 HTML format ...... 50 5.4.6 Online Report ...... 52

5.5 Online Result report tool ...... 53

5.5.1 The create tag operation ...... 55 5.5.2 The obtain result operation ...... 55 5.5.3 The delete result operation ...... 56

vi 5.5.4 The update result operation ...... 56 5.5.5 The append result operation ...... 57

5.6 Measuring overhead in current OpenMP offloading implementations . 57

6 TEST EXAMPLES ...... 61

6.1 Offloading to multiple devices ...... 61 6.2 Handling task dependencies ...... 63 6.3 Mapping C++ features ...... 65 6.4 Mapping linked-list to device ...... 69 6.5 Matrix Multiplication ...... 71

7 FINDINGS AND RESULTS OF THIS WORK ...... 73

7.1 Test bed configurations ...... 73 7.2 Specification findings ...... 75 7.3 Testsuite results ...... 77

7.3.1 Summary of used systems, compilers and compiler versions .. 78 7.3.2 Programming language evolution ...... 78 7.3.3 Compiler version results, errors and evolution ...... 80

7.3.3.1 AOMP ...... 80 7.3.3.2 IBM XL ...... 82 7.3.3.3 GNU GCC ...... 84 7.3.3.4 LLVM/ ...... 86

7.4 Overhead ...... 87

7.4.1 Offloading on Summit ...... 88 7.4.2 Offloading on Fatnode ...... 91 7.4.3 Combined Constructs ...... 95 7.4.4 The effect of number of teams and number of threads ..... 96

8 CONCLUSIONS AND FUTURE WORK ...... 99

8.1 The future of OpenMP ...... 101

REFERENCES ...... 104

vii LIST OF TABLES

5.1 List of operations supported by the OMPVV header files for test formatting ...... 38

5.2 Set of rules available in the Makefile ...... 40

5.3 List of supported out-of-the-box compilers with the used flags for each 42

7.1 List of used compiler vendors, languages and their versions ..... 78

7.2 List of compilers per system, compiler versions and used flags ... 79

viii LIST OF FIGURES

1.1 Top 500 List: number of systems reported in the list using accelerators (left axis), and evolution of the average number of cores per sockets (yellow line and right axis) ...... 1

1.2 Overlapping the evolution of OpenMP and OpenACC directive base programming models specification releases with figure 1.1 ...... 3

1.3 Counting elements per OpenMP Specification Documents. Number of pages in the PDF, Number of Constructs and Directives defined, Number of API functions defined, and number of Environmental Variables defined...... 6

3.1 OpenMP 4.0+ Execution model...... 14

5.1 Workflow for developing the Validation and Verification Suite ... 34

5.2 Project folder tree structure...... 36

5.3 Snapshot of the HTML report generated by the testsuite ...... 51

5.4 Diagram of the Online Report Infrastructure ...... 54

6.1 Task graph created by Code 6.2 ...... 64

6.2 UML Diagram of the example in listings 6.3, 6.4, and 6.5 ...... 66

7.1 Testbed systems. The different execution environments used in this study ...... 74

7.2 Percentage of Pass and Fail in all results per programming language 80

7.3 AOMP Version evolution ...... 81

7.4 IBM XL Version evolution ...... 82

ix 7.5 GNU GCC Version Evolution ...... 84

7.6 LLVM/Clang Version evolution. YTK Corresponds to the CORAL Clang Compiler...... 86

7.7 Overhead measurement for offloading directives on Fatnode cluster 89

7.8 Overhead measurement for offloading directives on Summit ..... 92

7.9 Comparing Combined directives vs. Nesting of those directives OpenMP directives ...... 93

7.10 target teams distribute varying the number of teams. Effects of the number of teams on the overhead of the runtime on multiple compilers and systems ...... 94

7.11 Effects of the number of teams and number of threads on the overhead of the runtime on multiple compilers and systems ..... 97

x LIST OF LISTINGS

3.1 Example of a complex sequential and parallel regions inside the target 19

3.2 Mapping code from 3.1 into dynamic parallelism...... 20

3.3 Mapping code from 3.1 into if-master coordination...... 22

3.4 Mapping code from 3.1 into executor/inspector...... 24

5.1 Example of SOURCES Makefile option ...... 43

5.2 Verbosity levels. Lines in green and starting with === are added by VERBOSE=1. Lines in purple and starting with --- are added by VERBOSE TESTS=1 ...... 44

5.3 System description file example ...... 47

5.4 Summary Report Example ...... 49

5.5 The structure of the JSON report file ...... 50

5.6 Online result output example...... 52

5.7 Online report TAG and APPEND options ...... 53

5.8 Overhead measurement testing methodology ...... 58

5.9 Contrasting Combined and Nested Directives ...... 59

6.1 Offloading computation to multiple devices ...... 62

6.2 Testing a task graph with dependencies...... 63

6.3 An implementation of a mapper class that will allocate memory on host and device...... 65

6.4 Inheritance from the mapper class of Listing 6.3 ...... 67

xi 6.5 Inheritance from the mapper class of Listing 6.4 ...... 68

6.6 Mapping linked list to device ...... 69

6.7 Matrix Multiplication ...... 71

xii ABSTRACT

The OpenMP language features have been evolving to meet the rapid develop- ment in hardware platforms. Device offloading was introduced in 2013 in the OpenMP 4.0 Specifications. Six years and two specifications releases later, offloading features have considerably improved. Nevertheless, new features have also increased the length and complexity of the specifications and the corresponding implementation. Therefore, compilers and system that support OpenMP offloading are potentially more prone to errors from bugs in the implementation or a miss-interpretation of the specifications. Several compiler vendors have implemented OpenMP 4.5 offloading and are currently working on OpenMP 5.0 implementations. The list includes GCC, Clang/LLVM, XL, CCE, AOMP, and ICC. There are 3 mainstream acceleration devices that can be used for for offloading: ’s Xeon Phi, AMD GPUs, and NVIDIA GPUs. On the other hand, the Department Of Energy (DOE) applications and lead- ing supercomputer systems tend to push the bleeding edge of features ratified in the OpenMP specification and tend to expose implementations’ rough edges. It is criti- cal for current and future DoE supercomputers and applications that compilers and systems correctly support OpenMP offloading features. This work presents the design and implementation of a testing infrastructure for OpenMP 4.5 offloading features. Our tests not only evaluate the OpenMP im- plementations but also expose ambiguities in the OpenMP 4.5 specification. We also evaluate compiler implementations using some kernels extracted from production DOE applications. This helps in assessing the interaction of different OpenMP directives in- dependent of other application artifacts. We see this as a synergistic eff