Evaluating Techniques for Parallelization Tuning in MPI, Ompss and MPI/Ompss

Evaluating techniques for parallelization tuning in MPI, OmpSs and MPI/OmpSs Advisors: Author: Prof. Jesús Labarta Vladimir Subotic´ Prof. Mateo Valero Prof. Eduard Ayguade´ A THESIS SUBMITTED IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor per la Universitat Politècnica de Catalunya Departament d’Arquitectura de Computadors Barcelona, 2013 i Abstract Parallel programming is used to partition a computational problem among multiple processing units and to define how they interact (communicate and synchronize) in order to guarantee the correct result. The performance that is achieved when executing the parallel program on a parallel architecture is usually far from the optimal: computation unbalance and excessive interaction among processing units often cause lost cycles, reducing the efficiency of parallel computation. In this thesis we propose techniques oriented to better exploit parallelism in parallel applications, with especial emphasis in techniques that increase asynchronism. Theoretically, this type of parallelization tuning promises multiple benefits. First, it should mitigate communication and synchro- nization delays, thus increasing the overall performance. Furthermore, parallelization tuning should expose additional parallelism and therefore increase the scalability of execution. Finally, increased asynchronism would allow more flexible communication mechanisms, providing higher tolerance to slower networks and external noise. In the first part of this thesis, we study the potential for tuning MPI parallelism. More specifically, we explore automatic techniques to overlap communication and computation. We propose a speculative messaging technique that increases the overlap and requires no changes of the orig- inal MPI application. Our technique automatically identifies the application’s MPI activity and reinterprets that activity using optimally placed non-blocking MPI requests. We demonstrate that this overlapping technique increases the asynchronism of MPI messages, maximizing the overlap, and consequently leading to execution speedup and higher tolerance to bandwidth reduction. However, in the case of realistic scientific work- loads, we show that the overlapping potential is significantly limited by the pattern by which each MPI process locally operates on MPI messages. In the second part of this thesis, we study the potential for tuning hybrid MPI/OmpSs applications. We try to gain a better understanding of the parallelism of hybrid MPI/OmpSs applications in order to evaluate how these applications would execute on future machines and to predict the execution bottlenecks that are likely to emerge. We explore how MPI/OmpSs applications could scale on the parallel machine with hundreds of cores per node. Furthermore, we investigate how this high parallelism within each node would reflect on the constraints of the interconnect. We especially focus on identifying critical code sections in MPI/OmpSs. We devised a technique that quickly evaluates, for a given MPI/OmpSs application and the selected target machine, which code section should be optimized in order to gain the highest performance benefits. Also, this thesis studies techniques to quickly explore the potential OmpSs parallelism inherent in applications. We provide mechanisms for the pro- grammer to easily evaluate potential parallelism of any task decomposition. Furthermore, we describe an iterative trial-and-error approach to search for a task decomposition that will expose sufficient parallelism for a given target machine. Finally, we explore potential of automating the iterative approach by capturing the programmers’ experience into an expert system that can autonomously lead the process of finding efficient task decompositions. Also, throughout the work on this thesis, we designed development tools that can be useful to other researchers in the field. The most advanced of these tools is Tareador – a tool to help porting MPI applications to MPI/OmpSs programming model. Tareador provides a simple interface to propose some decomposition of a code into OmpSs tasks. Then, based on the proposed decomposition, Tareador dynamically calculates data de- pendencies among the annotated tasks, and automatically estimates the potential OmpSs parallelization. Furthermore, Tareador gives additional hints on how to complete the process of porting the application to OmpSs. Tareador already proved itself useful, by being included in the academic classes on parallel programming at UPC. Acknowledgement On January 12th 2007, I started my Ph.D. studies at the UPC. In the first two hours of “sitting in front of computer”, I managed to format my Linux partition. Today, I’m about to defend my doctorate in computer science. I’m not the same guy from my first day, both personally and professionally. Many people helped me on my way. I owe them many thanks. I’m very thankful to Jesus, who taught me to be methodological and direct – I would say a brutal engineer. I’m thankful to Mateo, whose encouragement and support I knew to recognize and appreciate fully always with a significant delay. I must thank Edu for “adopting” me close to the end of my studies and helping me finish my Ph.D. in a surprisingly good mood. I’m thankful to Jose Carlos, who enormously helped me writing papers. Also, I wish to thank my numerous colleagues from Barcelona Supercomputing Center who shared with me their working and non-working hours and made my Ph.D. time both more productive and enjoyable. I must pick out Saša and Srdjan, who showed me many tricks of the trade. And, I would like to specially mention Uri Prat, a big guy who had amazing patience with me, even in the days when I was a technical idiot. Finally, this thesis would not be possible without my girlfriend Jelena. She has stuck with me through thick and thin. iv Errata My graduate work has been financially supported by the following projects: • Computacion de altas prestaciones V (TIN2007-60625). • TEXT – Towards EXaflop applicaTions. EU Commission, Informa- tion Society Technologies, IST-2007-261580. • BSC-IBM MareIncognito project. IBM Research. • Intel-BSC Exascale Lab. Intel Corporation. Also, throughout my Ph.D. studies, I was receiving a grant from Barcelona Supercomputing Center. v Contents Acknowledgement iv Contents vi List of Figuresx List of Tables xiv 1 Introduction1 1.1 Goals . .2 1.2 Approach . .3 1.3 Contributions . .4 1.4 Document structure . .6 2 Background8 2.1 Parallel machines . .9 2.1.1 Processor architecture trends . .9 2.1.2 Memory Organization . 11 2.1.3 Message Passing Cost . 14 2.2 Parallel programming models . 15 2.2.1 MPI . 16 2.2.2 OpenMP . 22 2.2.3 OmpSs . 27 2.2.4 Example - MPI/OmpSs vs. MPI/OpenMP . 33 2.3 Tools . 35 vi CONTENTS 2.3.1 mpitrace . 36 2.3.2 Valgrind . 36 2.3.3 Paraver . 37 2.3.4 Dimemas . 37 3 Motivation 39 3.1 MPI programming . 39 3.1.1 Bulk-synchronous programming . 40 3.1.2 Communication Computation Overlap . 41 3.2 Our effort in tuning MPI parallelism . 42 3.2.1 Automatic overlap . 43 3.3 MPI/OmpSs programming . 45 3.3.1 Hiding communication delays . 45 3.3.2 Additional parallelism within an MPI process . 47 3.4 Our effort in tuning MPI/OmpSs parallelism . 48 3.4.1 Identifying parallelization bottlenecks . 49 3.4.2 Searching for the optimal task decomposition . 50 4 Infrastructure 51 4.1 Simulation aware tracing . 52 4.1.1 Illustration of the methodology . 53 4.2 Framework to identify potential overlap . 54 4.2.1 Implementation details . 56 4.3 Framework to replay MPI/OmpSs execution . 59 4.3.1 Implementation details . 61 4.4 Tareador – Framework to identify potential dataflow parallelism . 64 4.4.1 Implementation details . 65 4.4.2 Usage of Tareador . 69 5 Overlapping communication and computation in MPI scientific applications 70 5.1 Characteristic application behaviors . 71 5.2 Automatic Communication-Computation Overlap at the MPI Level . 73 5.2.1 Automatic overlap applied on the three characteristic behaviors 75 vii CONTENTS 5.3 Speculative Dataflow – A proposal to achieve automatic overlap . 77 5.3.1 Protocol of speculative dataflow . 77 5.3.2 Emulation . 78 5.3.3 Hardware support . 83 5.3.4 Conclusions and future research directions . 83 5.4 Quantifying the potential benefits of automatic overlap . 84 5.4.1 Experimental Setup . 85 5.4.2 Patterns of production and consumption . 85 5.4.3 Simulating potential overlap . 91 5.4.4 Conclusions and future research directions . 102 6 Task-based dataflow parallelism 104 6.1 Identifying critical code sections in dataflow parallel execution . 105 6.1.1 Motivation . 106 6.1.2 Motivation example interpreted by the state-of-the-art techniques108 6.1.3 Experiments . 110 6.1.4 Conclusion and future research directions . 118 6.2 Tareador: exploring parallelism inherit in applications . 119 6.2.1 Motivating example . 120 6.2.2 Experiments . 121 6.2.3 Conclusion and future research directions . 129 6.3 Automatic exploration of potential parallelism . 130 6.3.1 The search algorithm . 131 6.3.2 Heuristic 2: When to stop refining the decomposition . 135 6.3.3 Working environment . 136 6.3.4 Experiments . 138 6.3.5 Conclusion . 146 7 Related Work 147 7.1 Simulation methodologies for parallel computing . 147 7.2 Overlapping communication and computation . 149 7.3 Identifying parallelization bottlenecks . 150 7.4 Parallelization development tools . 152 viii CONTENTS 8 Conclusion 155 8.1 Future work . 157 8.1.1 Parallelism for everyone: my view . 158 9 Publications 161 Bibliography 164 ix List of Figures 2.1 The evolution of computing platforms. Chart originally from SPIRAL project website [72]........................... 10 2.2 Example MPI code . 18 2.3 Execution of the example MPI code . 19 2.4 simple OpenMP code . 24 2.5 simple OpenMP code . 24 2.6 Pointer chasing application parallelized with OpenMP . 26 2.7 Porting sequential C to OmpSs . 28 2.8 OmpSs implementation of Cholesky .

Evaluating Techniques for Parallelization Tuning in MPI, Ompss and MPI/Ompss

Beyond BIOS Developing with the Unified Extensible Firmware Interface

Intel Advisor for Dgpu Intel® Advisor Workflows

Hands-On Intel® Software Development & Oneapi WORKSHOP

Introduction to Intel Performance Tools Part

Intel® Software Products Highlights and Best Practices

Intel® Offload Advisor

Intel® Composer XE 2013 Product Brief

Openmp Instrumentation In

Optimization Workshop

Performance and Scalability Analysis of CNN-Based Deep Learning Inference in the Intel Distribution of Openvino Toolkit Tuesday, 24 September 2019 11:15 (15 Minutes)

Learn MORE Intel® Advisor

Intel Compiler