Parallel MPI I/O in Cube: Design & Implementation

Parallel MPI I/O in Cube: Design & Implementation Bine Brank A master thesis presented for the degree of M.Sc. Computer Simulation in Science Supervisors: Prof. Dr. Norbert Eicker Dr. Pavel Saviankou Bergische UniversitätWuppertal in cooperation with Forschungszentrum Jülich September, 2018 Erklärung Ich versichere, dass ich die Arbeit selbstständigverfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich gemacht habe. Mit Abgabe der Abschlussarbeit erkenne ich an, dass die Arbeit durch Dritte eingesehen und unter Wahrung urheberrechtlicher Grundsätzezi- tiert werden darf. Ferner stimme ich zu, dass die Arbeit durch das Fachgebiet an Dritte zur Einsichtnahme herausgegeben werden darf. Wuppertal, 27.8.2017 Bine Brank 1 Acknowledgements Foremost, I would like to express my deepest and sincere gratitude to Dr. Pavel Saviankou. Not only did he introduce me to this interesting topic, but his way of guiding and supporting me was beyond anything I could ever hoped for. Always happy to discuss ideas and answer any of my questions, he has truly set an example of excellence as a researcher, mentor and a friend. In addition, I would like to thank Prof. Dr. Norbert Eicker for agreeing to supervise my thesis. I am very thankful for all remarks, corrections and help that he provided. I would also like to thank Ilya Zhukov for helping me with the correct in- stallation/configuration of the CP2K software. 2 Contents 1 Introduction 7 2 HPC ecosystem 8 2.1 Origins of HPC . 8 2.2 Parallel programming . 9 2.3 Automatic performance analysis . 10 2.4 Tools . 11 2.4.1 Score-P . 11 2.5 Performance space . 12 2.5.1 Metrics . 12 2.5.2 Call paths . 13 2.5.3 Locations . 14 2.5.4 Severity values . 15 3 Input/Output in HPC 17 3.1 MPI derived datatypes . 17 3.2 MPI I/O . 19 3.3 Performance of MPI I/O . 22 3.4 I/O challenges in HPC community . 23 4 Cube 25 4.1 Cube libraries . 25 4.2 Tar archive . 26 4.2.1 Tar header . 27 4.2.2 Tar file . 28 4.3 Cube4 file format . 29 4.3.1 Metric data file . 30 4.3.2 Metric index file . 34 4.3.3 Anchor file . 34 4.4 CubeWriter . 35 4.4.1 Usage . 35 4.4.2 Library architecture . 36 4.4.3 How Score-P uses CubeWriter library . 40 5 New writing algorithm 42 5.1 Metric data file . 42 5.1.1 File partition . 42 5.1.2 MPI steps . 44 5.2 Metric index file . 45 5.3 Anchor file . 45 6 Implementation 46 6.1 New library architecture . 46 6.2 Reconfiguring Score-P . 49 3 7 Results and discussion 51 7.1 System . 51 7.2 Performance measurement with a prototype . 52 7.2.1 Prototype . 52 7.2.2 Results . 53 7.3 Performance measurement with CP2K . 57 7.4 Discussion . 59 8 Conclusion 61 References 62 A Appendix - source code 64 A.1 cubew cube.c . 64 A.2 cubew metric.c . 68 A.3 cubew parallel metric writer.c . 71 A.4 cubew tar writing.c . 73 A.5 scorep profile cube4 writer.c . 77 A.6 prototype new.c . 85 A.7 prototype old.c . 94 4 List of Figures 1 Memory architectures . 9 2 Call tree . 13 3 System tree . 15 4 Filetype . 19 5 File partitioning . 20 6 Cube libraries . 25 7 CubeGUI . 26 8 Layout of TAR archive . 27 9 Sequence of files in Cube4 tar archive . 29 10 Inclusive vs. exclusive values . 31 11 Tree ordering . 32 12 Structure of metric data file . 33 13 Simplified UML diagram of CubeWriter . 37 14 Internal steps of CubeWriter . 39 15 Score-P sequence diagram . 41 16 File partition . 42 17 New algorithm . 43 18 Filetypes of processes . 44 19 Internal steps of the new CubeWriter library . 48 20 New Score-P sequence diagram . 50 21 Writing time of different files in tar archive . 53 22 Execution time . 54 23 Overall prototype writing speed . 55 24 Writing speed of dense metrics . 56 25 Overall writing time for different call tree sizes . 57 26 Writing time for H2O-64 benchmark . 58 27 Overall writing speed for H2O-64 benchmark . 59 5 List of Tables 1 Data access routines . 22 2 Write performance of MPI . 23 3 Call path order . 32 4 Metrics in the prototype . 52 6 1 Introduction The need for analysis of complex parallel codes in high performance computing has caused a growing number of different performance analysis tools. These tools help the user to write a parallel code, which is efficient and does not use more computing resources than absolutely necessary. The user is able to measure how his application is behaving and thus gain insight into the problems and bottlenecks it might have. Measuring a performance of a large-scale application gives a lot of information. This happens because such large applications can be executed on million computing cores and one is provided with the measurement data for each of them. Writing all this information into a file can therefore be a very slow process. This thesis revolves around the work of redesigning and rewriting the Cube- Writer library, a part of Cube software for writing an application's performance report. We propose a new parallel approach, where each process writes its own measurement data synchronously. The rest of this thesis is organized as follows. Chapter 2 introduces the reader to high performance computing. We describe how the analysis of complex parallel codes led to the development of automatic performance analysis tools. After that, a brief overview of Score-P is given and a description of how measuring an application's performance forms a three dimensional performance space. Chapter 3 deals with parallel input/output operations in HPC, with a main focus on the I/O part of message passing interface (MPI). In chapter 4, we go into the details of Cube, which is the main topic of this thesis. We briefly explain the Cube libraries, before giving an explanation of the Cube4 file format, which is a tar archive. We conclude the chapter by explaining how this file is written by the CubeWriter library. We then describe how the library could be rewritten to include parallel writing. This new writing algorithm and its implementation are described in chapters 5 and 6. In chapter 7, we take a look at the results and how the new CubeWriter library gives a much better performance than the old version. In the last chapter, we conclude the work and give some ideas for the future development. 7 2 HPC ecosystem 2.1 Origins of HPC HPC or High Performance Computing (sometimes also Supercomputing) is a term used for aggregating the computer power to achieve a much higher performance than that of standard desktop computers of their time. Such high performance machines are usually used to solve large linear algebra problems that arise from science and engineering. Origins of HPC date back to 1950s, but in the mid-1970s a first big break- through occurred with the production of Cray 1 [1]. Produced in 1975, it is commonly regarded as the first successful supercomputer. At that time, technology of the most advanced devices was based on vector processing. A vector processor is a processing unit in which a set of instructions operate on a vector of data and not on a single data item. In the 1980s and early 1990s, computers based on vector processes were beginning to be overtaken by massively parallel machines. In such machines, many processors work together on a different parts of a larger problem. Contrary to vector-based systems, which were designed to run instructions on a vector of data as quickly as possible, massively parallel computer instead separates parts of the problem to entirely different processors and then recombines the results. The research took a major shift to such machines. New high-speed networks, availability of low-cost processors and software for distributed computing led to a development of computer clusters, a set of tightly connected computers that work together. First such system was a Beowulf cluster, built in 1994 by NASA [2]. Today, such computer architecture is prevailing in both industry and academic community. In the beginning of the 21st century, another big step occurred with the development of multi-core processors which are able to run multiple instructions on separate cores at the same time. Almost all computers on the market today are equipped with such processors. Early 2000s was also important for the development of general purpose graphics processing units (GPGPU). With an extensive research from the gaming industry, the technology of GPGPUs has advanced and was found to fit scientific computing as well. Nowadays, graphics processing units form a major part in some of the world's fastest supercomputers. The performance of supercomputers is measured in floating-point operations per second (flops). Floating point operation is a basic arithmetic operation in a double precision floating point calculations. Double precision format is a computer number format that occupies 64 bits in computer memory. A common way to measure performance is by the LINPACK benchmark, which solves a dense system of linear equations. The fastest supercomputer currently is Summit at the Department of Ener- gys Oak Ridge National Laboratory in the USA [3]. Based mostly on NVIDIA Tesla V100 GPUs, Summit reached a maximum speed of 122:3 · 1015 flops on 8 the LINPACK benchmark. Many powerful supercomputers around the world are already in the petaflops (1015) region and the research on new hardware technologies is causing a never-ending growth of computing power.

Parallel MPI I/O in Cube: Design & Implementation

NOAA Technical Report NOS NGS 60

A Longitudinal and Cross-Dataset Study of Internet Latency and Path Stability

Misleading Stars: What Cannot Be Measured in the Internet?

PAX A920 Mobile Smart Terminal Quick Setup Guide

PAX-It! TM Applications the Paxcam Digital USB 2.0 Camera System

Freebsd and Netbsd on Small X86 Based Systems

Release 0.11.0

Z/VM Version 7 Release 2

The Emaginepos Guide to Pax / Emv / Broadpos

An Airborne Network Telemetry Link for the Inet Technical Demonstration System

LDM7 Deployment Guidance

Objections to Killer Robots