Parallel MPI I/O in Cube: Design & Implementation

Bine Brank

A master thesis presented for the degree of M.Sc. Computer Simulation in Science

Supervisors: Prof. Dr. Norbert Eicker Dr. Pavel Saviankou

Bergische Universit¨atWuppertal in cooperation with Forschungszentrum J¨ulich

September, 2018 Erkl¨arung

Ich versichere, dass ich die Arbeit selbstst¨andigverfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlich gemacht habe. Mit Abgabe der Abschlussarbeit erkenne ich an, dass die Arbeit durch Dritte eingesehen und unter Wahrung urheberrechtlicher Grunds¨atzezi- tiert werden darf. Ferner stimme ich zu, dass die Arbeit durch das Fachgebiet an Dritte zur Einsichtnahme herausgegeben werden darf.

Wuppertal, 27.8.2017 Bine Brank

1 Acknowledgements

Foremost, I would like to express my deepest and sincere gratitude to Dr. Pavel Saviankou. Not only did he introduce me to this interesting topic, but his way of guiding and supporting me was beyond anything I could ever hoped for. Always happy to discuss ideas and answer any of my questions, he has truly set an example of excellence as a researcher, mentor and a friend.

In addition, I would like to thank Prof. Dr. Norbert Eicker for agreeing to supervise my thesis. I am very thankful for all remarks, corrections and that he provided.

I would also like to thank Ilya Zhukov for helping me with the correct in- stallation/configuration of the CP2K software.

2 Contents

1 Introduction 7

2 HPC ecosystem 8 2.1 Origins of HPC ...... 8 2.2 Parallel programming ...... 9 2.3 Automatic performance analysis ...... 10 2.4 Tools ...... 11 2.4.1 Score-P ...... 11 2.5 Performance space ...... 12 2.5.1 Metrics ...... 12 2.5.2 Call paths ...... 13 2.5.3 Locations ...... 14 2.5.4 Severity values ...... 15

3 Input/Output in HPC 17 3.1 MPI derived datatypes ...... 17 3.2 MPI I/O ...... 19 3.3 Performance of MPI I/O ...... 22 3.4 I/O challenges in HPC community ...... 23

4 Cube 25 4.1 Cube libraries ...... 25 4.2 archive ...... 26 4.2.1 Tar header ...... 27 4.2.2 Tar file ...... 28 4.3 Cube4 file ...... 29 4.3.1 Metric data file ...... 30 4.3.2 Metric index file ...... 34 4.3.3 Anchor file ...... 34 4.4 CubeWriter ...... 35 4.4.1 Usage ...... 35 4.4.2 Library architecture ...... 36 4.4.3 How Score-P uses CubeWriter library ...... 40

5 New writing algorithm 42 5.1 Metric data file ...... 42 5.1.1 partition ...... 42 5.1.2 MPI steps ...... 44 5.2 Metric index file ...... 45 5.3 Anchor file ...... 45

6 Implementation 46 6.1 New library architecture ...... 46 6.2 Reconfiguring Score-P ...... 49

3 7 Results and discussion 51 7.1 System ...... 51 7.2 Performance measurement with a prototype ...... 52 7.2.1 Prototype ...... 52 7.2.2 Results ...... 53 7.3 Performance measurement with CP2K ...... 57 7.4 Discussion ...... 59

8 Conclusion 61

References 62

A Appendix - source code 64 A.1 cubew cube.c ...... 64 A.2 cubew metric.c ...... 68 A.3 cubew parallel metric writer.c ...... 71 A.4 cubew tar writing.c ...... 73 A.5 scorep profile cube4 writer.c ...... 77 A.6 prototype new.c ...... 85 A.7 prototype old.c ...... 94

4 List of Figures

1 Memory architectures ...... 9 2 Call ...... 13 3 System tree ...... 15 4 Filetype ...... 19 5 File partitioning ...... 20 6 Cube libraries ...... 25 7 CubeGUI ...... 26 8 Layout of TAR archive ...... 27 9 Sequence of files in Cube4 tar archive ...... 29 10 Inclusive vs. exclusive values ...... 31 11 Tree ordering ...... 32 12 Structure of metric data file ...... 33 13 Simplified UML diagram of CubeWriter ...... 37 14 Internal steps of CubeWriter ...... 39 15 Score-P sequence diagram ...... 41 16 File partition ...... 42 17 New algorithm ...... 43 18 Filetypes of processes ...... 44 19 Internal steps of the new CubeWriter library ...... 48 20 New Score-P sequence diagram ...... 50 21 Writing of different files in tar archive ...... 53 22 Execution time ...... 54 23 Overall prototype writing speed ...... 55 24 Writing speed of dense metrics ...... 56 25 Overall writing time for different call tree sizes ...... 57 26 Writing time for H2O-64 benchmark ...... 58 27 Overall writing speed for H2O-64 benchmark ...... 59

5 List of Tables

1 Data access routines ...... 22 2 performance of MPI ...... 23 3 Call path order ...... 32 4 Metrics in the prototype ...... 52

6 1 Introduction

The need for analysis of complex parallel codes in high performance com- puting has caused a growing number of different performance analysis tools. These tools help the user to write a parallel code, is efficient and does not use computing resources than absolutely necessary. The user is able to measure how his application is behaving and thus gain insight into the problems and bottlenecks it might have. Measuring a performance of a large-scale application gives a lot of informa- tion. This happens because such large applications can be executed on million computing cores and one is provided with the measurement data for each of them. Writing all this information into a file can therefore be a very slow pro- cess. This thesis revolves around the work of redesigning and rewriting the Cube- Writer library, a part of Cube software for writing an application’s performance report. We propose a new parallel approach, where each process writes its own measurement data synchronously. The rest of this thesis is organized as follows. Chapter 2 introduces the reader to high performance computing. We describe how the analysis of com- plex parallel codes led to the development of automatic performance analysis tools. After that, a brief overview of Score-P is given and a description of how measuring an application’s performance forms a three dimensional performance space. Chapter 3 deals with parallel input/output operations in HPC, with a main focus on the I/O part of message passing interface (MPI). In chapter 4, we go into the details of Cube, which is the main topic of this thesis. We briefly explain the Cube libraries, before giving an explanation of the Cube4 file format, which is a tar archive. We conclude the chapter by explaining how this file is written by the CubeWriter library. We then describe how the library could be rewritten to include parallel writing. This new writing algorithm and its implementation are described in chapters 5 and 6. In chapter 7, we take a look the results and how the new CubeWriter library gives a much better performance than the old version. In the last chapter, we conclude the work and give some ideas for the future development.

7 2 HPC ecosystem 2.1 Origins of HPC HPC or High Performance Computing (sometimes also Supercomputing) is a term used for aggregating the computer power to achieve a much higher performance than that of standard desktop computers of their time. Such high performance machines are usually used to solve large linear algebra problems that arise from science and engineering. Origins of HPC date back to 1950s, but in the mid-1970s a first big break- through occurred with the production of Cray 1 [1]. Produced in 1975, it is commonly regarded as the first successful supercomputer. At that time, tech- nology of the advanced devices was based on vector processing. A vector processor is a processing unit in which a set of instructions operate on a vector of data and not on a single data item. In the 1980s and early 1990s, computers based on vector processes were beginning to be overtaken by massively parallel machines. In such machines, many processors work together on a different parts of a larger problem. Contrary to vector-based systems, which were designed to run instructions on a vector of data as quickly as possible, massively parallel computer instead separates parts of the problem to entirely different processors and then recombines the results. The research took a major shift to such machines. New high-speed networks, availability of low-cost processors and software for distributed computing led to a development of computer clusters, a set of tightly connected computers that work together. First such system was a Beowulf cluster, built in 1994 by NASA [2]. Today, such computer architecture is prevailing in both industry and academic community. In the beginning of the 21st century, another big step occurred with the de- velopment of multi-core processors which are able to run multiple instructions on separate cores at the same time. Almost all computers on the market to- day are equipped with such processors. Early 2000s was also important for the development of general purpose graphics processing units (GPGPU). With an extensive research from the gaming industry, the technology of GPGPUs has advanced and was found to fit scientific computing as well. Nowadays, graphics processing units form a major part in some of the world’s fastest supercomput- ers.

The performance of supercomputers is measured in floating-point operations per second (flops). Floating point operation is a basic arithmetic operation in a double precision floating point calculations. Double precision format is a computer number format that occupies 64 bits in computer memory. A common way to measure performance is by the LINPACK benchmark, which solves a dense system of linear equations. The fastest supercomputer currently is Summit at the Department of Ener- gys Oak Ridge National Laboratory in the USA [3]. Based mostly on NVIDIA Tesla V100 GPUs, Summit reached a maximum speed of 122.3 · 1015 flops on

8 the LINPACK benchmark. Many powerful supercomputers around the world are already in the petaflops (1015) region and the research on new hardware technologies is causing a never-ending growth of computing power. Current research to achieve higher speeds is known as an exascale computing project, which plans to achieve 1 exaflop (1018 flops) in the near future.

2.2 Parallel programming Traditionally, computer software has been written for serial computation. Instructions that up an algorithm are executed in a defined order, only one at a time. When one instruction is finished, the next instruction begins. From software engineer’s point of view, this refers to code being run sequentially (line by line), no two lines being executed at the same time. But with the introduction of first parallel machines, one should be able to write code, that will be executed in a parallel manner. This is known as parallel programming. Many classifications of parallel programming models exist, based on process interaction and problem decomposition. An important aspect is that of memory architecture. There are two basic models: shared memory and distributed memory architecture. If many process- ing units can directly access the same location in memory, we are talking about the shared memory parallel programming. Typical examples of this paradigm are early symmetric multiprocessing machines, where two or more identical pro- cessors are connected to one main memory. Although personal computers nowa- days have a memory for each core in a multi-core processor, their smart non- uniform memory access makes them behave like a shared memory architecture. Such architectures must also provide a cache coherence, which ensures that the data stored in local caches is updated every time the data from the memory is accessed. If on the other hand, each processor has its own memory, we are talking about the distributed memory. Both architectures are shown on figure 1.

Figure 1: Memory architectures Figure 1 shows an example of two different memory architectures. Each ’P’ block corresponds to one processing unit and ’M’ block to one memory unit.

MPI or message passing interface has become a de facto standard as a pro- gramming paradigm for the distributed memory architecture. It provides widely used library routines that enable both point-to-point and collective communi- cation between different processes running on the same system. By system, we refer to the cluster of machines which are physically connected together.

9 OpenMP on the other hand, is the most widely used interface for shared memory architectures. Commonly associated term is ”multi-threading”, be- cause certain tasks are divided between different threads, which work in a - allel manner. Today’s supercomputers are built of many computing nodes, each node hav- ing many processing units, that share the same memory. Therefore, it is not uncommon to have a hybrid model of a parallel application, that uses both MPI and OpenMP. OpenMP is then used for parallelization within a multi-core node, and MPI for parallelization between the nodes. All the work described in this thesis is based on MPI and OpenMP interfaces.

2.3 Automatic performance analysis Apart from general bugs that can arise in software engineering, a new set of problems appear in parallel applications. These are caused primarily by the concurrent execution of parallel algorithms. Since we do not have a typical line by line execution of the code, we can say that each process/thread performs its own set of instructions. The programmer needs to ensure that these processes work synchronously in an efficient manner. Most of the problems can be put into two groups: communication related and program structure related. [4] Communication related problems are caused by the communication between different processors. For example, we take a look at the most simple MPI rou- tines MPI Send() and MPI Receive(). The process that executes MPI Send() (process 0), sends data to process 1. Afterwards, process 1 executes a call to MPI Receive(), to receive and store the sent content to its memory. If process 1 tries to receive the data before process 0 sends it, it cannot do so. It will enter the idle state and until process 0 sends it. Such problems are very common in the HPC community, and can have a severe impact on the efficiency of the application. Structure related problems are caused by the structure of our application. Parallel programs are most efficient, if the workload is evenly distributed among the processes. In real cases, there is always a part of the problem that cannot be parallelized. In this case, Amdahl’s law tells us the highest possible speed up, and one should be able to write an efficient code that would in practice get close to it. An example of a structure related problem can be easily observed with a call to MPI Barrier(), which synchronizes processors to the same point in code. If one processor takes very long time compared to others to reach that call, all other processes will have to enter the idle state and wait within the MPI Barrier(). Because the HPC community deals with good performance and high effi- ciency from one side, but increasingly complex hardware and software from the other, there is a necessity for analyzing and optimizing complicated parallel behavior. It forces the programmer, apart from designing the application, to analyze the performance of it. A normal debugger is not useful to detect the above mentioned problems. Consequently, a need for such analysis has led to

10 a development of tools for an automatic performance analysis of parallel pro- grams. These tools have become mandatory for the development of massively parallel applications.

2.4 Tools Currently, there exist many tools for performance analysis and optimiza- tion of parallel applications. They form an important contribution in parallel programming community, so let us name a few: HPCToolkit [5] is a suite of tools that supports measurement, analysis and attribution of the application performance for both sequential and parallel pro- grams. Periscope [6] is a distributed performance analysis tool that can identify a wide range of performance bottlenecks in parallel applications. Paraver [7] is a flexible tracing and visualization tool, which helps to get a qualitative global picture of the application behavior by a visual inspection. Dimemas [8] is a performance prediction tool for MPI applications, that helps the user to perform a ’what if’ analysis. Scalasca [9] performs post-mortem analysis of event traces and automatically detects performance critical situations. Score-P [10] is a joint measurement infrastructure, jointly developed by some of the leading HPC performance tools groups (Periscope, Scalasca, TAU and Vampir). Because work in this thesis involves parts of Score-P, we explain it in more detail.

2.4.1 Score-P Different tools listed above provide different aspects of features for analysis, so it is worthwhile for users to use them in combination. From user’s perspective, this leads to multiple learning curves, different options for equivalent features and redundant effort for installations and updates. At the same time, tools also share many similarities and overlapping functions. The main goal of Score- P is to provide a joint infrastructure to address these problems. This way, community has a single platform for performance measurements, that supports a rich collection of analysis tools. Score-P (Scalable Performance Measurement Infrastructure for Parallel Codes) consists of the instrumentation framework and several run-time libraries. It al- lows users to insert measurement probes into C/C++ and Fortran codes. These measurement probes collect performance-related data like the time of execution, visits or hardware counters. To collect such data, the application needs to against one of the provided run-time libraries. Score-P is used together with regular compilers like gcc, g++ or gfortran. The prefix ’scorep’ is added to the usual compile and link commands. The instrumentator then detects the programming paradigm used and adds the ap- propriate compiler and flags, before building the application. Once the application is instrumented, user just needs to run it to collect the measured

11 data. Different variants are configurable by options to the ’scorep’ command. Score-P includes a wide range of features like profiling and event tracing, direct instrumentation, adaptive measurement, hardware counters and many more. It also has an efficient memory management system that stores data in thread-local chunks of pre-allocated memory. This results in an accurate and non-invasive measurement, where the run-time is increased by no more than 4 percent (usu- ally much ) [10]. Score-P can write performance data in many different formats. In this thesis, we focus on the Cube4 file format, which Score-P uses by default. The Cube4 file format together with the Cube software is explained in more detail in chapter 4.

2.5 Performance space The performance of an application is described along three dimensions: met- rics, call paths and locations. These three dimensions form a performance space. Its size depends on a measurement system, application, and computer, on which the application was executed. We give a more detailed explanation of the three dimensions, as they are necessary for a further discussion.

2.5.1 Metrics The first dimension is a set of performance properties called metrics, which give answer to the question of what is being measured. They provide a different insight in the analyzed application. Without extra configuration, Score-P pro- vides data in 6 metrics. This can be extended by enabling different performance counters or metric plugins that Score-P includes. The most important metrics are Time and Visits metrics. The Time metric measures the time of the - ecution. By examining times for different threads or processes, we can imbalance in the workload. To account for clock differences of processor-local clocks, the measuring system uses a linear offset interpolation of the timestamps. The Visits metric measures how many times a certain function has been called in the application. Functions which are frequently visited usually have a higher impact on the performance of the application. These two metrics are the most basic and could also be used for the analysis of serial applications. But there are many more, which are specifically used for the analysis of parallel applications. They are typically related to the MPI or OpenMP interfaces. The MPI Time metric measures time spent in MPI calls. The OpenMP Time measures time spent in OpenMP calls and code generated by the OpenMP com- piler, which includes thread management and synchronization activities. The MPI Communication Time gives time spent in MPI communication calls. This includes point-to-point, collective and one-sided communication. The Wait at Barrier metric measures how much time processors spent in MPI Barrier(). The Late sender metric measures time a receiving process is waiting for a mes- sage in a routine MPI Receive(). Number of received bytes in MPI point-to- point communications is measured in the P2P bytes received metric. These are

12 only a few of the most important metrics. For example, Scalasca defines more than 60 metrics and for more information about them, one should refer to the Scalasca documentation [11]. Which metric is more useful for the user depends on the application and what kind of problems are to be expected. Metric dimension is organized in an inclusion hierarchy, where a metric at a lower level is a subset of its parent. For example, MPI time and OpenMP time are subsets of the Execution time metric, which is a subset of the Time metric. For future reference, let us call the number of all metrics, provided by the measurement system, as Nmetrics.

2.5.2 Call paths The second dimension is the set of call paths (also call nodes). Each call path corresponds to a function call in the code. Metrics are always measured on call paths. The Time metric therefore not only measures the total run time of the program, but also how much time was spent in each invoked function in our code. All call paths together form a call tree. The call tree is a tree data structure. The parent-child relation is defined by the location of where a call to a function occurred in the code. If a certain function invokes other functions, then all those are its children. The root node in a call tree is therefore always associated to the function main() in C programming language. (Or the equivalent function in Fortran, where the program execution begins.) Leaf nodes correspond to functions, that do not call any other functions. An example is shown in figure 2.

Figure 2: Call tree We see an example of how structure of the code forms a call tree. On the left side, we have a pseudo code of a simple program and on the right its corresponding call tree. Each node in the call tree represents one call path. The call node main() has four children, foo() two, and bar() and barbar() only one child.

13 The size of the second dimension is therefore related to the complexity and size of the code. Highly recursive functions can have a severe impact on the depth of the call tree. Let us denote the number of all call paths as Ncallpaths.

2.5.3 Locations Because the application runs in parallel, a call path itself will not be suffi- cient to recognize a particular problem if the application is running on different processors. Hence we define the third dimension. The third dimension is the system location and corresponds to a thread or process, on which the code ran. We distinguish different possibilities. If the application is parallelized with just MPI interface, locations will correspond to different MPI processes. If the application uses just OpenMP interface, locations are different OpenMP threads. In this thesis, we always assume a more general hybrid application that uses both interfaces. Each MPI process spawns OpenMP threads, where number of threads per process does not need to be the same for all MPI processes. In such model, locations also correspond to different threads, which are now spawned by different MPI processes. Number of locations is equal to the of all threads:

Np X Nlocations = Nti i=1 where Np is a number of MPI processes and Nti number of OpenMP threads spawned by i-th process. If the application does not use OpenMP, we have

Nlocations = Np

Similarly, if the application uses just OpenMP and not MPI, number of locations equals to the number of threads spawned in OpenMP. In case of today’s supercomputers, each MPI process is also linked to the compute node on which it was initialized, and all compute nodes are linked to the machine. This structure forms a system tree. A simple example is presented on figure 3.

14 Figure 3: System tree Example of a system tree, where machine consists of two compute nodes. The application runs with 4 MPI processes (2 per node), each spawning 2 OpenMP threads. We have a total of 8 locations.

It is equivalent to say, that the number of locations is the number of leaf nodes in a system tree. Locations are ordered by MPI ranks. OpenMP threads from rank 0 are followed by those from rank 1 and so on. Because modern HPC systems might have up to million of processors, third dimension can in practice be very large. As we will later see, system tree and call tree are the main reason for big files and thus need for the improvement of the profile writer.

2.5.4 Severity values Each coordinate in this three-dimensional space is mapped onto a numeric value called severity value, which is a result of measurement. Severity value represents metric’s severity for a corresponding call path at a given system location. Another aspect of looking at this, is to ask ourselves what, when and where do we measure in the application. ’What’ then refers to performance metric, ’when’ refers to a call path and ’where’ to a system location. Let us take an example to better understand how a particular value in per- formance space could mean a potential performance problem in our application. We are looking at a metric Wait at barrier at a particular call path. We notice that the value on one process is much smaller than values on other ranks. This means that all other processes spent a lot more time in MPI Barrier(), and were all waiting for one process. This is usually a bad practice, because all other processes could in this time already be computing others things. Value, that represents severity can be of different types, depending on the metric. All metrics that measure time, produce results in a floating-point ’double’. How many times a function has been visited can only be a integer, so the Visits metric produces result in a type ’unsigned int’. Cube software also gives the option of metrics producing more complex types, like complex numbers, rationals, or ’tau atomic’ type. Score-P measurement system produces

15 metric in three types: double, unsigned long long (uint64 t), and tau atomic type. Let us take a closer look at tau atomic. It is composed of 5 different values with a total size of 36 bytes. struct tau_atomic { /* byte offset */ uint32_t N, /* 0 */ uint64_t min, /* 4 */ uint64_t max, /* 12 */ uint64_t sum, /* 20 */ uint64_t sumsquared, /* 28 */ } Tau atomic serves for the easier computation of statistics. N stands for number of visits, min and max for minimal and maximal values, sum and sumsquared for sum of values and sum of values squared. All together we have a total of Nmetrics · Ncallpaths · Nlocations severity values in the performance space. But because some metrics measure data only on particular call paths, many of severity values are equal to zero. This is explained in more detail in chapter 4.3.1.

16 3 Input/Output in HPC

Parallel input/output (I/O) operations, where many processors write data to a single file at the same time, can technically be achieved within the POSIX interface, but its implementation is typically very inefficient. For high perfor- mance I/O, the system needs to provide a high-level interface which supports partitioning of the file data among processes and complete transfers of global data structures between the process memory and the file. MPI [12], being the most widely used interface for parallel programming, supports such file access. All POSIX I/O operations have their replacement functions in the MPI interface. Partitioning of the file among processes is im- plemented through derived datatypes. Because they are essential for the under- standing of this thesis, we describe them in more detail.

3.1 MPI derived datatypes MPI predefined datatypes (MPI INT, MPI DOUBLE, ...) which correspond to the basic C or Fortran data types are often too constraining. Let us take a look at a simple example. We have an array of data which is composed of interchanging types of char- acters and integers: (char, int, char, int, ...). We would like to use the function MPI Send() to send such structure from one process to another using just predefined datatypes. The first option would be to all values into two separate arrays: one of just characters and one of just integers. We would then send both arrays with two calls of MPI Send(). Process which would receive the data would again have to transform the two arrays into one, to obtain the original sequence in memory. This option is slow due to the copying of both arrays. The second option would be to call MPI Send() for each character and each integer separately. This gets rid of copying, but it uses many MPI calls. In particular, MPI is much more efficient in communicating a small amount of large data structures rather than a large number of smaller data structures. Large number of separate calls leads to a significantly bigger communication overhead. MPI therefore provides mechanisms for more general, mixed and even non- contiguous communications buffers. In a noncontiguous buffer, data is not stored in memory in a single chunk, but is distributed across distant mem- ory locations. It allows one to directly transfer objects of various shapes and sizes. Such general objects are called derived datatypes. They are an extension of the predefined datatypes, similar to structs with stride in C programming language. MPI offers many constructors to build derived datatypes. Before we look at them, we familiarize ourselves with some definitions.

Derived datatype is an opaque object that requires two things: a sequence of basic datatypes (also called type signature of the datatype) and a sequence of integer (byte) displacements. A pair of such sequences is called a type map.

17 Example of a derived datatype:

type signature = {MPI CHAR, MPI DOUBLE, MPI CHAR, MPI DOUBLE, MPI CHAR, MPI DOUBLE} displacements = {0, 1, 9, 10, 18, 19} type map = {(MPI CHAR, 0), (MPI DOUBLE, 1), (MPI CHAR, 9), (MPI DOUBLE, 10), (MPI CHAR, 18), (MPI DOUBLE, 19)}

Such type map together with a base address in memory specifies a com- munication buffer. It consist of n entries, where the i-th entry is at the ad- dress base adress + displacement(i). Basic predefined datatypes are just a particular case of general datatypes where the type signature consist of one item with its displacement equal to zero. Thus, MPI DOUBLE is a predefined handle to a datatype with type map {(MPI DOUBLE, 0)}. The extent of the datatype is defined to be the difference from the first to the last byte occu- pied by the entries in this datatype, rounded up to satisfy memory alignment requirements. Datatype’s lower-bound is the minimal byte displacement. It’s upper-bound is a displacement up to which a datatype has been aligned to. Therefore extent = upperbound − lowerbound.

We take a look at two examples of derived datatype constructors, which we will later use:

MPI_Type_Contiguous(int count, MPI_Datatype oldtype, MPI_Datatype* newtype); MPI Type contiguous(...) is the simplest datatype constructor. It takes a handle to the oldtype and makes count copies at contiguous places. It stores the new datatype in a handle newtype. The old datatype does not need to be a predefined datatype, it can also be a derived datatype. If oldtype was a derived datatype with type map {(MPI CHAR, 0), (MPI DOUBLE, 1)} and extent 9, a call to MPI Type contiguous(3, oldtype, &newtype) would create a new derived datatype with a type map the same as in the above example.

MPI_Type_create_struct(int count, int array_of_blocklengths[], int array_of_displacements[], MPI_Datatype array_of_types[], MPI_Datatype* newtype);

MPI Type create struct() is the most general datatype constructor. It creates count blocks, where the i-th block is located at array of displacements[i] and contains array of blocklengths[i] datatypes of type array of types[i].

18 There are many more constructors and for more information, one should refer to the MPI documentation. An important aspect of derived datatypes is the ability to construct non- contiguous buffers. By this we mean a buffer, that includes memory locations not associated to the actual data in memory. In other words, derived datatype has ’holes’ (gaps) at particular displacements. This can be easily achieved with MPI Type create struct(), if the value array of displacements[i] does not coincide with the location, where the block i − 1 ended. This would cause holes between different blocks. Another common way is to explicitly define lower and upper bounds that do not coincide with the end of data in type signature. This is done with a call to MPI Type create resized(). This function takes the oldtype, changes its lower-bound and extent, and puts the new datatype in a handle newtype.

MPI_Type_create_resized(MPI_Datatype oldtype, MPI_Aint lower_bound, MPI_Aint extent, MPI_Datatype* newtype);

After constructing the datatype we want, we can use its handle as an - gument in all point-to-point and collective communication calls, as well as in input/output operations.

3.2 MPI I/O MPI has replacement functions for POSIX I/O. Before looking at them, we need to familiarize ourselves with a few terms and explain what they mean. MPI file is an ordered collection of typed data items and is opened collec- tively by a group of processes. All collective calls on this file are executed over this group. MPI file is a replacement for a file in POSIX sense. A file displacement is an absolute byte position relative to the beginning of a file. An etype (elementary datatype) is the unit of data access and positioning. All data access is performed in etype units. Etype can be either a predefined or a derived datatype. A filetype is the basis for partitioning a file among processes and defines a template for accessing the file. Like etype, filetype is also either a predefined or a derived datatype. Filetype repeated in contiguous locations constructs a view, that a process has. Example is seen in figure 4.

Figure 4: Filetype A simple example of how a filetype forms a view of the file.

19 A view defines the current set of data visible and accessible from an open file as an ordered set of etypes. Each process has its own view of the file, defined by three quantities: a displacement, an etype, and a filetype. The pattern described by the filetype is repeated, beginning at the displacement. If view is not explicitly defined, MPI will use its default view, with displacement equal to zero, and etype and filetype equal to MPI BYTE. This corresponds to a view in POSIX sense, if a file is opened with fopen(). An offset is a position in the file relative to the current view, expressed as a count of etypes.

As seen in the previous subchapter, we can construct datatypes with holes. Using such noncontiguous buffers for a filetype, a group of processes can use complementary views to achieve a global distribution of data in a file. We define a view where each process will be able to write to its part of the file. We look at the following example:

Figure 5: File partitioning The figure shows an example of data distribution using different filetypes for different processes. With a proper configuration of holes in a filetype, we can achieve a contiguous memory access, where no two processors are reading or writing to the same location. In this example, file is partitioned among three MPI processes.

Let us now take a closer look at a procedure for reading and writing of files with MPI. Before performing any reading or writing, we open a file with MPI File open(). This is equivalent to the function fopen() in C programming language, except that it is collective and must be called by all processes in the communicator.

MPI_File_open(MPI_Comm communicator, char* filename, int access_mode, MPI_Info ,

20 MPI_File* fh);

We then need to construct the appropriate derived datatypes, which we will use as etypes and filetypes. This is done with any datatype constructors, like in chapter 3.1. To achieve the partitioning as seen in figure 5, filetype must involve ’holes’, to skip parts of the file, written by other processes. To satisfy the need for complementary views, filetypes should be of the same lengths. After defining the appropriate etype and filetype, we then set the view with:

MPI_File_set_view(MPI_File fh, MPI_Offset displacement, MPI_Datatype etype, MPI_Datatype filetype, char* datarep, MPI_Info info );

This function is also collective, each MPI process creates its own view of the file. We can call this function more than once, if we want to access different parts of the file with a different view. Data is moved between processes and the file by calling read and write functions. There are three orthogonal aspects to data access: Positioning - we can specify the location of our reading and writing calls either with explicit offsets, individual file pointers or shared file pointers. Data access with explicit offsets stands for the routines where we explicitly provide the offset in bytes from the beginning of the file, where we want to read or write data. Individual file pointers are the equivalent version of the file pointer in C or C++. Each process has its own individual file pointer and each routine updates the pointer to the next data item, after the last one is accessed. Furthermore, MPI maintains one shared file pointer per file. Synchronism - we can perform blocking and non-blocking input/output routines. Blocking functions do not return, until the reading or writing is com- pleted. Non-blocking routines are routines where a function just initiates the process of reading or writing and returns immediately, before the access is ac- tually complete. There is also a possibility of a restricted form of non-blocking called -collective routines. Coordination - collective vs. non-collective. Collective calls are called by all processes in the group that opened the file. Non-collective can be called by just one process. Table 1 shows three orthogonal aspects of data access and their MPI routines. POSIX fwrite() and fread() are blocking, non-collective operations which use individual file pointers. Their MPI equivalents are MPI File write() and MPI File read().

21 Table 1: Data access routines Table presents many different routines, which exist for different data access. [12]

In this thesis we use blocking calls with individual file pointers (in both collective and non-collective manner). Collective call, which will be needed for the parallel writing is:

MPI_File_write_all(MPI_File fh, void* buffer, int count, MPI_Datatype datatype, MPI_Status* status );

where fh is the file handle and buffer is the initial address of the buffer to be written into the file. Each process writes count elements of a buffer, where each element is of type datatype.

3.3 Performance of MPI I/O Parallel applications have different I/O access patterns, and each can be presented to the I/O system in several ways, based on which I/O function the application uses. Generally, we can say that each process needs to access a large number of small pieces of data, which are not contiguously located in the file. A common case is a two-dimensional block distribution of a matrix. Matrix is written to the file row by row. We can classify this access pattern in four different levels [13]. Level 0 - Each process performs one write request for each local row inde- pendent of other processes. Level 1 - Same as level 0, except writing requests are collective.

22 Level 2 - Each process creates a derived datatype to represent the non- contiguous access pattern, defines a file view, and performs one write request, independent of other processes. Level 3 - Similar to level 2, but now processes call a write request collectively.

I/O bandwidth (MBytes/s) machine processors Level 0/1 Level 2 Level 3 HP Exemplar 64 0.54 1.25 50.7 IBM SP 64 1.85 1.85 57.6 Intel Paragon 256 1.12 3.33 183 NEC SX-4 8 0.62 75.3 447 SGI Origin200 32 5.06 13.1 66.7

Table 2: Write performance of MPI Table shows different performance for various access levels on different machines.[13]

We see that level 3 offers much higher performance. This is because of two important factors - data sieving and collective calls. Data sieving is a tech- nique that reduces the number of individual calls for non contiguous accesses. It is much better to make large I/O requests and then extract the data needed, rather than perform many small I/O requests. Collective calls enable the im- plementation to analyze and merge the requests of different processes. In many cases, the merged request may be large and contiguous, although the individ- ual requests were noncontiguous. The merged request can therefore be serviced more efficiently.

3.4 I/O challenges in HPC community Scientific applications on large-scale computers can read and write a lot of data. In year 2015, a study [14] has analyzed the I/O behavior of over a million jobs of supercomputing applications over six years across three leading high-performance computing platforms. (Intrepid and Mira at the Argonne Leadership Computing Facility (ALCF) and Edison at the National Energy Research Scientific Computing Center (NERSC).) Major results from the study are: 1. Most of the applications still use POSIX related input/output routines and not parallel I/O libraries. Nearly 95% on Intreprid (80% and 50 % respec- tively on other two machines) used POSIX exclusively. Remaining jobs used MPI I/O directly or used libraries which are built on of MPI I/O. 2. Small number of applications and jobs dominate the usage of platform I/O resource. 90% of the total I/O time was used by no more than 4% of applications on Interprid. (3% and 6% on other machines) 3. Very low I/O performance is the norm for most applications. All three machines had a peak I/O bandwidth close to 1 TB/s. Study showed that almost three quarters of the applications never exceeded 1 GB/s, roughly 0,1%. One

23 reason for this is also that most applications write data in text format, which generally does not scale well. Writing data in a native (binary) format is a much better practice. All these results seem very bad at first sight, but it must be noted that most of the applications do not perform large scale I/O operations. In most cases, serial reading and writing of the data is sufficient and there is no need for parallelism. Although in some cases, the I/O performance could satisfy its owners, this does not necessarily hold for platform owners. Therefore, the greatest possibility in saving resources lies in identifying an application’s I/O performance before it becomes the top consumer of resources on the platform. Automatic performance analysis is especially useful here.

24 4 Cube

Cube [15] is an explorer for performance reports, generated by automatic performance analysis tools. It is currently being developed by Forschungszen- trum J¨ulich. It was originally a part of the Scalasca toolset. Now it is available as a separate component, and is used by both Scalasca and Score-P measure- ment infrastructure.

4.1 Cube libraries Cube provides a set of libraries and tools to store and analyze performance profiles. The CubeWriter library, which writes measurement data in the Cube4 file format is the main focus of this thesis. It will be explained in more detail in chapter 4.4. CubeLib is a general purpose C++ library that includes tools for storing and manipulating measured profiles. Both Scalasca and Score-P use CubeLib and CubeWriter to store the per- formance measurement data in a Cube4 file format. Cube also includes the graphical user interface CubeGUI to visually display the measured performance data and a java reader library (jCubeR). This can be seen on figure 6. Current up to date version is Cube 4.4.

Figure 6: Cube libraries The figure shows a relation between Cube libraries. Main focus of this thesis is the CubeWriter library, which is used by Score-P to write data in the Cube4 file format.

The Cube4 file is opened by the CubeGUI (with the help of CubeLib) to visually display data. Running applications on large scale machines produces a large number of locations. This usually results in a bad user experience, because the user has to find values that stand out in huge data sets. In CubeGUI, non- leaf nodes in metrics, call tree, or system tree can be collapsed or expanded, to achieve the desired level of granularity and detail. This makes it easy to identify the source of the problem. In addition, each severity value is displayed with a colored square, where a color depends on the value. This enables an easy identification of nodes of interest. CubeGUI also supports a flat call tree, which is represented as a flat sequence of all call paths. This may be useful, if one is interested in the severities of certain methods, independently of the position

25 of their invocations. Many features can be extended using a set of predefined plugins [16]. Figure 7 shows a screenshot of CubeGUI. Display is composed of three panels which correspond to described three dimensions.

Figure 7: CubeGUI Above figure shows a screenshot of CubeGUI. Left panel shows metrics, middle panel call tree, and right panel system tree. Each tree node can be collapsed or expanded. (In our case metric node Execution is expanded, as well as ’bt’ method in the call tree and all processes in the system tree.)

The CubeWriter library produces files in the Cube4 file format which is a tar archive. We explain the structure of tar archive in more detail, because it is necessary for the understanding of the next chapters.

4.2 Tar archive Tar is the utility in computer software for archiving many different files into one archive file, also known as tarball. Name ’TAR’ comes from (T)ape (AR)chive, because this utility was originally developed to write data to sequen- tial I/O devices. Due to historic reasons about tape drives and storage space on magnetic types, tar archives are always written in blocks of many 512-byte records. Each of the files to be stored in the tarball is preceded by a 512-byte header record. File together with its tar header is called file object. Tarball is ended with two 512-bytes NULL blocks which equals to two empty tar - ers. This indicates the end of tar archive. A graphical representation of the tar archive layout is shown on figure 8.

26 Figure 8: Layout of TAR archive The figure shows a sequence of structures in a tar archive.

4.2.1 Tar header Tar header contains descriptive meta data about the file it precedes. It consists of different fields, each of them with a strictly defined location (offset) and size. Original pre-POSIX format of the header from year 1988 contained 9 fields, but most modern day programs use Ustar (Unix Standard Tar), which was introduced by the POSIX IEE P1003.1 standard. It provides seven additional header fields, total of 16, for more information and allows for longer file names. Ustar tar header is defined as follows: struct ustar_tar_header { /* byte offset */ /*Short description*/ char name[100]; /* 0 */ Name of the file char mode[8]; /* 100 */ File mode char uid[8]; /* 108 */ User id char gid[8]; /* 116 */ Group id char size[12]; /* 124 */ Size of the file char mtime[12]; /* 136 */ Last modification time char chksum[8]; /* 148 */ Checksum char typeflag; /* 156 */ Type of archived file char linkname[100]; /* 157 */ Name of linked file char magic[6]; /* 257 */ UStar indicator char version[2]; /* 263 */ UStar version char [32]; /* 265 */ Owner user name char gname[32]; /* 297 */ Owner group name char devmajor[8]; /* 329 */ Device major number char devminor[8]; /* 337 */ Device minor number char prefix[155]; /* 345 */ File name prefix char pad[12] /* 500 */ Padding };

27 The Ustar format is still compatible with the original pre-POSIX format, as the locations and offsets of the first nine fields have remained the same. Older tar programs are therefore still able to read the new tar format, because the ad- ditional information is simply ignored. To ensure the portability of tar archives across different architectures, the information in the header file is encoded in ASCII. Furthermore, all numerical values (checksum, size, mtime) are encoded as ASCII digits in octal base with leading zeros. A final character in each field should be either NUL or space. Checksum field represents the sum of all bytes in the header block. First we initialize all eight checksum bytes to ASCII spaces (decimal number 32). We then sum unsigned byte values of all fields. Checksum field is stored as a six digit octal number followed by NUL and then a space. Typeflag field provides information in case of special files. Some possible values are ’0’ (ASCII nul) for normal file, ’1’ for hard link, ’2’ for symbolic link, ’3’ for device file, etc. Size field has the size of 12 characters. Together with a final NUL character, that leaves 11 characters which describe the file size in bytes. The maximum number that can therefore be written in such format is 811 = 8589934592. Consequently, the biggest file that can be written to preserve the Unix standard tar rules is roughly 8.6 GB. This limitation was once a serious problem with tar files. To overcome this, POSIX.1-2001 standard defines an extended tar or PAX format. Two additional typeflag values are defined, ’x’ and ’g’ to signalize that this is not a normal tar header, but a pax extended header. Pax extended header is composed of a normal tar header (with a typeflag value ’x’ or ’g’), extended header records, and another normal tar header. Such structure is necessary so that the format is still compatible with earlier version of tar (extended header records is treated like an additional file). Extended header records overwrites certain fields of the normal tar header, like size. Therefore if the file size is bigger than 8.6 GB, it will be preceded by a tar header, extended header records and another tar header in that order. For easier understanding, let us call these three 512-byte blocks structures also tar headers. So from now on, every time we refer to the tar header, we keep in mind that it can be either 512 bytes or 3 · 512 bytes long.

4.2.2 Tar file The file itself is written unchanged as there are no constrains regarding its structure. Tar supports both text and binary files. Because of the 512 byte block structure of tar archives, we must always add a padding to a 512 bytes at the end of the file. In other words, tar headers can only on multiples of 512 bytes in the archive.

28 4.3 Cube4 file format

The Cube4 file format is written as a tar archive. It consists of 2·Nmetrics +1 files. For each metric we have two files - a metric data file and a metric index file. At the end we have an additional anchor file. Because Cube format is written as a tar archive, each of the above mentioned files is preceded by a tar header file and padded to a multiple of 512 bytes. Files are ordered by metrics, for each metric we first have a metric data file, followed by a metric index file. Anchor file is the last file. Layout of the files in tar archive is displayed in figure 9.

Figure 9: Sequence of files in Cube4 tar archive The figure shows how Cube internal files are positioned in the tar archive.

Tar archive is chosen due to easy out-of-order seeking. Because tar header provides information on the file size, we can skip to the file of interest relatively fast. In addition, tar archive is convenient for maintenance. With the use of hex editors, we have a very low access to the file contents, which is useful for studying and debugging. In addition, tar is supported by all Linux installations, even the most basic. Let us now take a closer look at the files inside the archive.

29 4.3.1 Metric data file First file - the metric data file includes severity values for a particular metric. File begins with a magic word - ’CUBEX.DATA’ file signature, which is written as 10 ASCII characters. Then all severity values for a metric are written. These values are written in a native (binary) format. struct metric_data_file { /* byte offset */ char[10] metric_data_header 0 double[size] sev_values 10 };

The order in which severity values are written is very important. Before we explain it, we must take a look at some different aspects of metrics.

Metric format stands for how many severity values are written in a file, because the data for a metric can either be sparse or dense. Reason is that some metrics provide data just for particular call paths and some provide data for all. Example of a sparse metric is metric MPI Time. For this metric, time measurement is evaluated only on call paths associated with MPI calls. All other call paths have the MPI time equal to zero. Number of MPI function calls is usually small compared to the number of all function calls in the application. For this reason, metric MPI Time is considered sparse. On the other hand, metric Time is measured on every call path. Metrics that include data for all call paths are considered dense. If a metric is sparse, the file contains severity values of only non-zero call paths. If a metric is dense, the file contains values for all call paths. Therefore, size of the metric data file depends on the format of the metric. Total size of the metric data file equals:

size = 10 + Nlocations · Ncnodes · sizeof(datatype) bytes, if Ncnodes is the number of all call paths in a dense case, or number of nonzero call paths in a sparse case. sizeof(datatype) is the size of datatype in which the severity values for this metric are stored.

Metric type describes the relation between measurements in a call tree. Metrics can store values either in exclusive or inclusive manner. In the inclusive value, information of all sub-elements is aggregated into a single value (value include values of its children). On the other hand, the exclusive value excludes values of its children and the information cannot be subdivided further. Lets take a look at an example to see how two cases differ in practice. A simple program contains of a function main() that calls functions foo() and bar(). If we about the overall time, that main() needed to execute, we are talking about its inclusive value. Inclusive value also includes the time of both foo() and bar() to execute. Exclusive time would be a time that the

30 program spent in function main(), without times spent in foo() and bar(). An inclusive value can never be less than the sum of inclusive values of its children. An example is shown on figure 10.

Figure 10: Inclusive vs. exclusive values The figure shows an example of two different types of values for the time metric. Both call trees show the same measurement, but in a different type. On the left side we have inclusive time values, and on the right side values in exclusive type. For leaf nodes, exclusive and inclusive values are equal.

Because a measurement system measures time as a difference between when the application stepped into the function and out of it, it is obvious that the Time metric measures values in a inclusive manner and is therefore inclusive. Example of an exclusive metric would be the Visits metric. Visits metric mea- sures how many times a function has been visited. The value is independent of the values of its children. Order of call paths severity values which are written in a file depends on whether a metric is inclusive or exclusive. [17] CubeGUI offers users the option to or collapse a call node in a tree. Expanded nodes show exclusive values and collapsed inclusive. Therefore, the library is transforming exclusive values into inclusive and vice-versa. Formula to compute exclusive values out of inclusive is: X texcl = tincl − tincl children Similarly we get for the opposite direction: X tincl = texcl + tincl children There is a fundamental difference in the above formulas. If we want to trans- form inclusive value into exclusive, we need just inclusive values of node’s direct children. However, transforming exclusive into inclusive, we must first compute inclusive values of node’s children. This leads to the recursive computation of inclusive values of the entire sub tree. Due to caches in computers, data access is faster, if the relevant information in a file is stored close together. Reason for that is that reading from a file

31 requires a transfer of many chunks of data into a memory. If data is close together, it is transfered in less chunks, which resolves in a higher speed. This is also known as data locality. If we have inclusive values, it is better to keep the data of children nodes together. This is achieved through inclusive order, which is the same as the breadth-first search algorithm of the tree traversal [18]. Leaf node is followed by all of his children, then all grandchildren, great grandchildren and so on. If we have exclusive values, we order nodes by the depth-first search. In this ordering, we are recursively always entering into a child node until we reach a leaf node. Because constructing both orders is already implemented in CubeWriter, we omit the details of how this is done. Important for us is that both orders are different and which one is chosen depends on the metric type. On figure 11 we take a look at how two orders look like in our example.

Figure 11: Tree ordering The figure shows inclusive and exclusive order for a simple call tree.

order in the file inclusive exclusive 1 main() main() 2 MPI Init() MPI Init() 3 foo() foo() 4 bar() foobar() 5 MPI Finalize() barfoo() 6 foobar() bar() 7 barfoo() barbar() 8 barbar() MPI Barrier() 9 MPI Barrier() MPI Finalize()

Table 3: Call path order

32 We can now finally describe the order of severity values in the metric data file. Values are primarily sorted by call paths as ordered above. This means severity values of all locations per call path. Furthermore, for each call path, values are sorted by the rank of MPI process, each process sorting values for all OpenMP threads it spawned. In pseudo code, it could be described with three nested for loops. First loop going through all call paths in either exclusive or inclusive order. Second loop goes through all MPI processes and third loop through OpenMP processes: for call_path = 1, 2, .. for MPI_rank = 1, 2, ... for OpenMP_thread = 1, 2, ... fwrite(sev_value); end end end The order can be better seen on figure 12.

Figure 12: Structure of metric data file Severity values in the metric data file are ordered in a particular call path order depending on the metric type. Values for a particular call path are ordered by locations. File includes just non-zero call paths.

After the severity values, there is a padding of null characters to a multiple of 512 bytes, to satisfy tar structure. All the numerical values are written in a binary form.

33 4.3.2 Metric index file Second file is the metric index file, which has meaning in case of sparse metrics. Metric index file stores the information about which call paths were written in the metric data file. struct index_file { /* byte offset */ uint32_t endian 0 uint16_t version 4 uint8_t metric_format 6 uint32_t size 7 uint32_t[size] indeces 11 };

Uint8 t, uint16 t and uint32 t are predefined C datatypes and stand for unsigned 8-, 16-, 32-bit integers respectively. Endian refers to the sequential order and is in our case always equal to 1. Version is equal to 0. Metric format stands for sparse or dense format. Size is the number of call paths written. Indeces is an array of callpath ids that were written in the metric data file. Field indeces is written just in sparse case. In case of a dense metric size is equal to 0 and the array indeces has no values.

4.3.3 Anchor file An additional anchor file stores the information about the analyzed appli- cation. It is written as an .xml file, but we will not go into the details of its structure, because it is not important for the understanding of next chapters. anchor file includes the information about the analyzed program (name, user, time). Furthermore, it provides the definitions of all metrics (name, type, for- mat, url,...), definitions of all regions and a definition of a system tree, where the program was running. It also stores types of metrics, so that the CubeGUI can correctly read the data and associate the call paths with the right values. Size of the anchor file depends heavily on the size of the call tree and system tree, because every call node and system tree node are explicitly defined in the file.

34 4.4 CubeWriter The CubeWriter library is used to write the performance profile in a de- scribed Cube4 file format.

4.4.1 Usage We briefly explain how the CubeWriter library is used from a user’s point of view. Before the actual writing to the file takes place, the CubeWriter library needs to gather information about the measured application, measurement - tem and the system on which the application ran. This includes the creation of the main Cube t struct which is created with a call to: cube_t* cube_create(char* Cube_name, enum CubeFlavours_t Cv, enum bool_t Compressed ) Cube name is the name of the Cube4 file to be created. Compressed is a boolean value, which informs Cube to the file (details about compression are omitted). Cv is a dummy argument, which was implemented for future parallel writing, but does not have a meaning in the current version. It will be explained in chapter 6. Above function then returns a pointer to the Cube t struct. After that, the user must provide definitions of all metrics, regions, call paths, locations, etc. Let us call these cube structures. The CubeWriter library provides constructors for all of them: cube def metric(), cube def region(), cube def cnode() and others. To each of these functions, a pointer to the Cube t struct is used as one of the arguments.

Next step for the user is to provide severity values for each metric. Before doing that, the function cube set known cnodes(...) is called, to tell the CubeWriter how many and which call paths metric has values for. The library then provides a set of functions, to write a set of all measurement values for a particular call path for some metric. An example for severity values in data type double is: cube_write_sev_row_of_doubles(cube_t* cube, cube_metric* metric, cube_cnode* callnode, double* sev_row)

Other similar functions with the same task exist for different data types (e.g. cube write sev row of uint64 t(...) for data type unsigned 64-bit integer). Arguments cube, metric, callnode are pointers to the structures we previously defined. Array sev row is an array of severity values for a particular callnode of a metric. It is user’s task to gather all severity values of a particular call path into an array sev row. Values in this array must be ordered by locations in the

35 same order they are written to the file, otherwise the file is erroneous. Func- tion cube write sev row of doubles(...) needs to be called for every call path that a metric has values for. In case of a dense metric, this means for all call paths in the call tree, and in sparse case just for non-zero call paths. More precisely, for all those that were marked in a call cube set known cnodes(...).

After finishing the above process for all defined metrics, the user must call a function cube free(). This function writes the additional anchor file (together with its header) and two empty tar headers to end the tar archive. It then closes the file and frees all memory that the library allocated. After this, the user can use the generated Cube4 file and open it with the CubeGUI.

4.4.2 Library architecture Here, we present the most important parts of the CubeWriter library. It must be immediately noted, that the library contains many features that are not covered in this thesis. We stick with those parts, that are necessary for the understanding of the writing algorithm. The library is written in C programming language. Although C is not an object-oriented language, object-oriented paradigms are implemented with the use of structs and pointers. A very simplified UML diagram is presented on figure 13.

36 Figure 13: Simplified UML diagram of CubeWriter The figure shows a very simplified UML diagram of the CubeWriter library with the most important structures, fields and methods. Many additional have been left out, due to the smaller scope of this thesis. 37 Cube t is the main struct, that includes pointers to all call nodes, met- rics, regions, and locations. It also contains all public methods from the previ- ous sub-chapter that are used by the user. When calling cube def metric(), cube def region() or cube def callnode(), cube t creates and initializes the associated structs. The Tar Writer struct handles the proper structure of files in a tar archive (tar headers and 512 byte block structure). We briefly explain the most important internal steps of how the CubeWriter library writes the tar archive. A lot of details are omitted. When the user calls cube write sev row of doubles(...), the library first calculates the po- sition of severity values for this call node in a file. After that, it seeks to that position and writes severity values into the file. Additionally, when this func- tion is called for the first time for any particular metric, the Tar Writer struct will call cube metric data start() which will create the tar header. When it is called for the last time, Tar Writer also calls cube metric data finish() which writes file padding to a multiple of 512 byte. CubeWriter then also writes the tar header for the metric index file and the metric index file itself. When we therefore call this function for the next metric, the CubeWriter library has successfully created the correct tar layout as seen in figure 9. Call to Cube free() writes the anchor file (with tar header and padding) as well as two empty tar headers for the end of tar archive. In addition, it frees all of the allocated memory and closes the Cube4 file, which is then available to the user. Sequence diagram for this entire process is presented on figure 14. Current CubeWriter library writes the Cube4 file entirely sequentially with no parallelization. Since function cube write sev row of doubles(...) takes an argument of severity values of all locations, it assumes that the root process knows all severity values, which is usually not the case. If the application runs with many processes, each process will measure data for its own execution of the application. This means that the root process will store just the measurement data of itself (and/or threads the process spawned). Let us take a look at how Score-P solves this problem.

38 Figure 14: Internal steps of CubeWriter Most important internal steps of the CubeWriter library, when the user calls cube write sev row of doubles() and cube free(). All steps are done only on root process.

39 4.4.3 How Score-P uses CubeWriter library Score-P uses CubeWriter to write its measurement data into the file. As described in chapter 4.4.1, Score-P first creates the Cube t struct and defines all cube structures. This is done only on the root process. Then, Score-P enters into the loop over all metrics. Before writing severity values, the function cube set known cnodes(...) is called which tells CubeWriter how many call paths have the measured values for the current metric. In Score-P, each MPI process measures its own data (data of all OpenMP threads it spawns). After the measurement, these values are still stored on each individual MPI process. The root process does not know the values of other pro- cesses automatically. To call the function cube write sev row of doubles(), Score-P first has to gather data on the root process, which is done in three steps: 1. Each process creates a local array that contains the severity values for its threads. 2. The root process gathers all arrays from other processors into a global array. If the number of threads per process is the same for every process, this is done through MPI Gather(), otherwise with MPI Gatherv(). 3. Root process calls cube get cnode() to get the right call node, and then cube write sev row of doubles(), that writes the entire row of severity values. This process is repeated for all call paths for all metrics. As we will later show, gathering of values on one process is the main reason for a slow perfor- mance of the current library. After that, cube free() is called on the root process to finish the file writing. Except for rank 0, other MPI processes have no contact with the CubeWriter library throughout this procedure. The sequence diagram of Score-P is presented on figure 15.

40 Figure 15: Score-P sequence diagram The figure shows the steps taken by Score-P when writing to the Cube4 file. All calls to CubeWriter are made from rank 0.

41 5 New writing algorithm

The current writing algorithm is very straightforward, but it does not provide the necessary speed, when metric data files become large. We want to construct an algorithm, where all processes write to the file simultaneously and there is a minimum communication between them. In addition, we want to remove the need for gathering severity values on the root process.

5.1 Metric data file 5.1.1 File partition The biggest files in tar archive are the metric data files. This is true especially for dense metrics, because severity values for all call paths are written in the file. Measurement systems like Score-P work in way, that each MPI process during application run time measures its own data, which means that the severity values are already stored on different processors. Because of this, we want to construct an algorithm, where each MPI process will write the severity values of its threads. This way, processes also do not have to send data to the root process, which is currently needed when using the library. Figure 16 demonstrates which severity values in the file are stored among which processes.

Figure 16: File partition Figure shows partitioning of the values in the metric data file. Each MPI process stores the severity values of its threads.

File needs to be partitioned between different MPI processes. Each process can view only that part of the file, where his severity values are located. This way, writing to the file will skip severity values of other processes and write only at the right locations. Luckily, metric data files posses a well organized repetitive structure for all processes, as seen on figure 16. Because number of locations per rank is the same for all call paths, each process writes chunks of data in strictly defined intervals, which are Nlocations values apart. To use MPI File write all(), each process must gather the severity values for all call paths in one array. Data needs to be ordered in the same way as call paths, depending on the metric type (inclusive/exclusive).

42 Figure 17: New algorithm The figure shows the new algorithm and how different ranks write different parts of the Cube4 file. 43 5.1.2 MPI steps The new algorithm is implemented with MPI I/O routines. Each process needs to construct an appropriate etype, filetype, and then set the right view, as described in chapter 3. We now show how to achieve this using correct MPI derived datatypes for this particular case. Etype is the unit of measurement in a file and depends on the datatype of a corresponding metric. Double leads to MPI DOUBLE, unsigned 64-bit integer to MPI UINT64 T and so on. In case of tau atomic datatype, etype needs to be created explicitly with datatype constructors and must have the same structure as tau atomic. In figure 16, we see the repetitive pattern for each call path. Filetype must have a length of Nlocations etypes, because the pattern is repeated after all locations for each call path. It consists of Nthreads etypes, which correspond to the severity values of rank’s threads. Severity values of other threads are written by other processors and must not be accessible. Because of this, filetype includes Nlocations − Nthreads holes on other places in the filetype. This ensures that each process skips parts of the file, where severity values of other processors are written. To achieve synchronization of the filetype with the right offsets of data in the file, we use the following approach. All etype units will be located at the begin- ning of the filetype. We will set up a view of the file with different displacements for each processor: r−1 X disp(r) = 10 + Nti i=1 if r is the MPI rank and Nti number of threads spawned by the i-th rank. 10 is added due to the magic word at the beginning. In other words, each process begins its view at the offset of severity values of his threads for the first call path in the file. This is done by calling MPI File set view(). Partitioning of the file with this approach is presented on figure 18.

Figure 18: Filetypes of processes Picture shows a simple example of filetypes for 4 processes. Each white cubicle represents a hole in a filetype and colored shows where the data is present.

44 All processes can then call MPI File write all() to write the metric data into the file. Because of the view we set, each processor will write only to those locations in a file, that he can ’see’ through the view. After collective writing, the root process calculates the difference to the 512-bytes mark. To satisfy tar layout rules, additional null characters must be written.

5.2 Metric index file The metric index file is still being written sequentially by the root process. Reason is that the metric index file does not posses any repetitive structure, so any reasonable partitioning of the file is not possible. Furthermore, data that is written in the metric index file is available only on the root process and the file size is small. To not skip any bytes while writing it, view of the file must again be set to the default view (linear byte stream, with etype and filetype equal to MPI BYTE). Writing to the file is done with MPI File write().

5.3 Anchor file The anchor file is also written by the root process, because it includes the information about Cube structures (regions, metrics, call paths, ..) and these structures are also known only to rank 0. Because the anchor file has much bigger size than the metric index file, we must be careful about how we are using MPI writing routines. There are two aspects of writing it. One idea is to write the file line by line. Each line is written by one call to MPI File write(). This is not the best practice, because calls MPI file write() are relatively expensive and we want to use them as little as possible. The fastest solution would therefore be to construct one array with the entire anchor file content and use just one writing routine. This is not possible due to limited size of the buffer. We compromise between high speed and limited buffer by writing the anchor file in big chunks. We allocate memory of a certain size, and fill it with the content of the file, before calling MPI File write(). We then repeat the process as many times it takes, to write the entire file.

45 6 Implementation

In this chapter, we present the most important aspects of implementation of the new algorithm.

6.1 New library architecture Because Score-P and Scalasca are primarily used for parallel applications, they are written on top of MPI and OpenMP. However, the current CubeWriter library is to be used in a non-parallel way by just one processor and is therefore written without any MPI or OpenMP. This now changes, as all processes will use the library simultaneously. We want to rewrite the library in such way, that its usage in Score-P is modified as little as possible. This means that all public methods will still have the same meaning, they are just modified to be compatible with the new algorithm. For each process to have access to the file, the Cube t struct now has to be created on every MPI process, which leads to a problem. The information about metrics, call tree and system tree (cube structures) is in Score-P available only on the root process. Instead of broadcasting the entire data about measurement system and application to all other processes, we initialize a fully defined Cube t struct with all cube structures only on rank 0. Structs Cube t on other ranks, will contain null pointers to these structures. This is not problematic, because parts of the file, where these are necessary (tar headers, metric index files, anchor), are anyway written only by rank 0. To easier distinguish the two cases, an enum cube flavour is created and added to the Cube t struct. Its value is equal to cube master on rank 0, and cube writer on other ranks. After the Tar writer struct is created, the function cube writing start() now creates an MPI file with MPI File open() instead of POSIX fopen(). All functions like cube write sev row of doubles() are transformed into their new versions, like cube write all sev rows of doubles(): cube_write_all_sev_row_of_doubles(cube_t* cube, cube_metric* metric, double* sev_rows)

This function is now collective and must be called by every process for every metric. Different MPI ranks use their Cube t struct as the first argument. Second argument is a pointer to the metric, and makes sense only on root process, because others do not have these structs initialized. Therefore, other ranks simply use a null pointer. Third argument, the sev row array now contains values for all threads spawned by the process for all call nodes for a metric. These values need to be sorted in the right call path order, depending on the metric type (exclusive vs. inclusive). There is no more argument for a call node, because all call nodes are written at once. As discussed in 5.1.2, the main problem of the parallel writer is the ini- tialization of the proper etype and datatype structures for collective MPI I/O

46 routines, which has to be done on every process. We gather functions related to this in a struct parallel metric writer, whose instance is added to cube t. Function prepare for writing() in struct parallel metric writer initializes the proper etype and filetype. For the proper initialization, following fields are added to the Cube t struct: number of my threads (to create the right filetype), number of all threads (to create a filetype of the appropriate length), offset of my threads (to calculate the right displacement in a file), as well as rank and number of all ranks. Constructor for the Cube t struct takes additional five arguments for these fields and now looks as following: cube_t* cube_mpi_create( char* cube_name, enum CubeFlavours_t cv, int world_rank, int world_size, int total_threads_number, int threads_number, int threads_offset, enum bool_t compressed )

When writing of a metric begins, the root process must first broadcast the in- formation about the metric’s datatype, so that each process initializes the right etype. This is done through the function prepare for writing() which also constructs the filetype. Function calculate displacement() is then called, which calculates the displacement of where a current view should start for dif- ferent processes. For this, root process must also broadcast information about the offset of the metric data file in a tar archive. View is then set for each processor with MPI File set view() with a proper displacement, filetype and etype, to achieve the right partitioning of the file. Af- ter, we write to the file with a call to MPI File write all(). For the communi- cation buffer, we use a pointer to array sev rows, which was used when calling cube write all sev rows of doubles(). Because this array already contains call paths in the right order, a file will be written with severity values in the right places. After writing, we set the view back to the normal sequential view, setting both etype and filetype to MPI BYTE, with another call to MPI File set view(). This is necessary for writing of the tar header and the metric index file. Other parts of tar archive, which are not written in a parallel manner, also need to be transformed. Because we opened a file with the MPI interface, all POSIX writing calls fwrite() need to be changed to MPI File write() (called only by root process). Main steps of the implemented algorithm are shown on figure 19.

The new library also includes the option of a PAX extended header, if the metric data file is bigger than 8.6GB. Function report metric data start() first calculates the size of the metric data file. If it is larger than the old tar header allows, it writes the PAX extended header, as defined in POSIX.1-2001 standard.

47 Figure 19: Internal steps of the new CubeWriter library Most important steps of the new CubeWriter library, when functions cube write all sev rows of doubles() and cube free() are called. One should compare this diagram with the one in figure 14, to see how the steps have changed. Certain functions are now called by all processes, but ones regarding tar headers, metric index files and anchor are still executed only on rank zero.

48 6.2 Reconfiguring Score-P Rewriting the CubeWriter library consequently causes a change in how a library is used in Score-P. For easier implementation, a Score-P branch was created and modified, to use the proper branch of the new CubeWriter library. This was done through the use of svn:externals. Score-P’s interaction with the CubeWriter library is handled in a file called scorep profile cube4 writer.c. The Cube t struct is now initialized on all processes. New arguments in the constructor of Cube t struct first have to be calculated in Score-P. To do that, root process at the beginning gathers number of threads per process from each rank. It then calculates both the number of all threads and offsets of threads for each rank. With MPI Bcast(), these informations are then passed to others, before calling cube mpi create(). Because cube write all sev rows of doubles() is now called just once per metric (collectively by all processes), Score-P needs to construct an array of severity values for all call paths for all threads for each process. Values in this array need to be in the same order as they are written into the file, depending on whether the metric is inclusive or exclusive. Since metric definitions are initialized only on the root process, other processes do not know the metric type in advance. For each metric, root process first informs the CubeWriter library about the non-zero call nodes for a metric. Then, it broadcasts this information to all other processes. Each process then constructs the array of values for all nodes in the right order. After that, processes collectively call cube write all sev rows of doubles(). Because all processes created the Cube t struct, they must also all call cube free(). The new Score-P sequence diagram is shown on figure 20.

49 Figure 20: New Score-P sequence diagram Figure shows the time sequence diagram of Score-P, using the new version of the CubeWriter library. One should compare this with the old diagram in figure 15.

50 7 Results and discussion 7.1 System The new library was tested on the supercomputer JURECA [19] in Forsch- ungszentrum J¨ulich. JURECA is equipped with 1872 compute nodes. Each compute node includes two Intel Xeon E5-2680 v3 Haswell 2.5 GHz multi-core processors with 12 cores each. Additionally, 75 compute nodes are quipped with two NVIDIA K80 GPUs (four visible devices per node). System memory is based on the DDR4 memory technology. 1605 compute nodes have a memory of 128 GiB, 128 nodes a memory of 256 GiB, and remaining nodes a memory 512 GiB. All together, this sums up to 45 216 CPU cores with a peak performance of 1.8 petaflops. JURECA uses a combination of Slurm (Simple Linux Utility for Resource Management) and Parastation for managing workload on available resources. Users need to submit batch applications (shell scripts) to send jobs to compute nodes. A simple example is:

#!/bin/bash -x #SBATCH --nodes=4 #SBATCH --ntasks=96 #SBATCH --ntasks-per-node=24 #SBATCH --output=mpi-out.%j #SBATCH --error=mpi-err.%j #SBATCH --time=00:15:00 #SBATCH --partition=batch srun ./prototype_new.c.exe

In this example, the file prototype new.c.exe was executed on 4 nodes with 96 MPI tasks (24 per node). mpi-out.%j is the name of the file for the output stream, where %j is the job ID (similarly for the error stream). Default batch partition is used and the maximum execution time is limited to 15 minutes. JURECA has a 100 GiB per second storage connection to Juelich Storage Cluster (JUST). JUST serves as a central GPFS fileserver (General Parallel File System) for the supercomputing systems (e.g. JURECA, JUWELS, QPACE3).

51 7.2 Performance measurement with a prototype 7.2.1 Prototype To analyze the speed of the new CubeWriter library, we write a prototype of the library usage. We do not measure performance of any application, we mimic the measurement system by defining our own random cube structures and assigning random severity values to the Cube. We call the appropriate functions from the CubeWriter library to write these values into the file. By constructing a prototype, we have more control over metrics and call tree and can construct an arbitrary Cube4 file. We define seven metrics. Two inclusive and five exclusive, to cover both aspects of metric types. Furthermore, two of them store values in type double, three in type uint64 t and two in type tau atomic. First, second and sixth metric are dense and others are sparse. Third and fourth metric include 5% of call nodes and fifth and seventh 20%.

metric datatype format 1 double dense 2 uint64 t dense 3 uint64 t sparse: 5% 4 uint64 t sparse: 5% 5 double sparse: 20% 6 tau atomic dense 7 tau atomic sparse: 20%

Table 4: Metrics in the prototype Table shows defined metrics in the prototype.

We expect the writing time of dense metrics to be the largest, because their metric data files have the biggest size. To see that writing behaves correctly for arbitrary cases of system tree, number of threads per MPI rank is not the same for all ranks. We set it as

Nti = r mod Nmax + 1 where r is the MPI rank and Nmax a constant. This means that rank 0 has one thread, rank 1 two threads, and so on until rank Nmax again starts with 1 rank. Size of the files will scale approximately linearly with the number of MPI processes, when number of MPI processes is large. Call tree is constructed in a pseudo-random way, which ensures that it stays the same during different executions of the prototype. Each call node is assigned between 1 and 13 chil- dren, where name of the individual functions are taken from a predefined list of . Apart from the prototype utilizing the new library, an identical prototype is also written for the old library for comparison. They are both written in a way that they exactly imitate Score-P. By this we mean that each MPI process first

52 assigns random severity values to its local threads. Then, in case of the new library, each process calls cube write all sev rows of doubles(). In the old case, MPI Gatherv() is called, which gathers all severity values for one call path to root process, before calling cube cube write sev row of doubles() for each call path. Produced Cube4 files are exactly identical. We measure the timings with calls to MPI Wtime(), which are placed in the beginning, between different metrics and at the end.

7.2.2 Results In order to see if the prototype behaves correctly, we first check the writing time of individual files. This can be seen on figure 21. Plots have different vertical scales.

Figure 21: Writing time of different files in tar archive Figure shows the stacked column plot of average times for individual metrics and the anchor file.

We can see that the most time is spent in the first, second and sixth metric, which is expected, because these metrics are dense. In addition, sixth metric takes more time than the other two, because it contains severity values in type tau atomic which is bigger than type double or uint64 t. Sparse metrics, anchor file and time needed to define Cube structures are small. To better see the overall results, we plot the timings in a log-log scale. Apart from the overall run time of the prototype, we also plot part of that time, which was actually spent in the CubeWriter library. Therefore, this time does not include gathering values into root process in old case. In new case, this is the

53 time after each process has already gathered values for all call paths into a single array. Prototype was executed n = 8 times and the results can be seen on figure 22.

Figure 22: Execution time Figure shows the overall prototype execution time for the cases of using the old and the new CubeWriter library. In addition, plot shows how much of that time was spent in the CubeWriter library.

Plot clearly shows, that the new library with the MPI I/O implementation is faster than the older version. When number of processes is small, time of writing is comparable to that of the older library. This happens because we are using only one I/O node, and the communication between processes is small. We also notice that on a small scale, results are more inconsistent due to really small times, but we are anyway not interested in these cases. When number of processes is big, the difference becomes bigger. The difference is caused by two factors, which are greatly improved in the new algorithm. Firstly, in the new algorithm, processes do not have to commu- nicate, before calling CubeWriter functions. We see that the difference between overall run time and the CubeWriter time becomes really big for old case, espe- cially when the number of processes grows. This is caused by a big number of MPI Gatherv() calls, which are more expensive when the MPI communicator is big. (Close to 90% of the entire runtime is spent in MPI Gatherv(), when the number of processes is bigger than 1024) In the new case, this routine is

54 no longer necessary. We must only combine all arrays of each call path into one array, which is much faster. This is also the reason, why the difference be- tween the CubeWriter time and the overall run time is close to zero in the new algorithm. The prototype spends almost entire time in the CubeWriter library. In figure 23, we see the overall writing bandwidth achieved by both libraries. Speeds were calculated from the overall time and sizes of Cube4 archives. We see that the speed of old algorithm makes the library much slower due to gathering many more values, as the number of MPI processes grows. On the other hand, new algorithm makes the performance scaling much better.

Figure 23: Overall prototype writing speed Figure shows the overall bandwidth of the prototype writing. We see that the performance gets really slow in old case, due to the large communication between processes.

Second part of the speed-up is caused by the pure parallel writing. This cannot be clearly seen on previous pictures, so we perform a new . We measure how much time CubeWriter spent writing just metric data files of the individual metrics (without tar headers and metric index files). We exclude the time for gathering values in old case, we measure only time spent inside CubeWriter library. To avoid small sizes, we limit ourselves only on dense metrics. We then calculate the writing speed. Results are shown on picture 24.

55 Figure 24: Writing speed of dense metrics Figure shows the speed of writing for dense metrics.

We see that the difference between the POSIX file writing and the MPI file writing becomes , when the number of processes is big. When it is small, the POSIX implementation is a bit faster, which is a result of the MPI com- munication overhead. Apart from the implementation, the performance of I/O operations (both POSIX and MPI) depends heavily on the hardware. We also notice that the errors are relatively big in both cases, which is a result of many factors, that affect the performance of I/O on supercomputers. Probably the main reason why this happens is that the supercomputer (and its input/output nodes) is always shared with other users. In the case of 4096 processes, we allocated 171 nodes on JURECA, which is slightly less than 10%. We have no information on jobs/work of other users. Speed of writing also depends on which compute nodes are used and all intermediate layers between compute nodes and the file system. JURECA’s 1884 compute nodes are distributed to 34 racks and there are two gateway switches for storage connectivity that connect the nodes in a ’fat tree’. There is a possible different behavior, if all I/O goes through one switch or both. On JURECA, we have no control over which compute nodes are assigned to our job. In addition, both versions of writing are buffered, which means that the writing to the file may not be executed immediately. Complete understanding of the JURECA’s I/O behavior unfortunately exceeds the scope of this thesis.

56 One more test was performed with the prototype, to see how the writing time depends on the number of call nodes in our application. This time, we always use 256 MPI processes, but we change the size of the call tree. Due to the limited computing resources for this project, the test was performed with only one run. Results are shown in figure 25. File size grows linearly with the number of call nodes and so does the writing time.

Figure 25: Overall writing time for different call tree sizes Overall writing time for a test case with 256 MPI processes and different call tree size.

7.3 Performance measurement with CP2K Tests with the prototype already show a clear improvement, but we want to test the changes to Score-P by measuring the performance of a real world application and then use the new CubeWriter library to write the performance report. For that reason, we use CP2K [20], an open source software for molecular dynamics. CP2K can perform a wide range of simulations on atomic scale from solid state, liquid, crystal and biological systems. This software was chosen, because its applications produce a very big call tree. It is written in Fortran programming language and can be parallelized with MPI and OpenMP. We perform tests on benchmark H2O-64, which calculates molecular dynam- ics using the Born-Oppenheimer approach. System contains 64 water molecules in a 12.4 A˚ cell. Benchmark has a call tree of around 370000 call nodes. Mea- surements were performed by Score-P’s internal runtime measurement system.

57 We performed tests with four configurations. 1 thread/process stands for the pure MPI parallelization, where one location corresponds to one MPI rank. In case of 6 threads/process, we use a combination of 6 OpenMP threads per MPI process (similar with 12 and 24 threads). For different configurations, number of locations and thus file size is the same, but we are changing the number of MPI processes - writers. Application was executed three times. Results can be seen on figure 26.

Figure 26: Writing time for H2O-64 benchmark Figure shows the time to write the performance report of H2O-64 benchmark.

In all cases, the new library performs much better than the old one. We also notice a difference between different configurations. In both cases, less MPI ranks leads to faster writing. In a case of serial writing, this happens because less processes participate in communication operations. In case of new library, it is not that clear. Possible reason is that the file size is still too small to exploit the better parallelization through MPI. When 1 thread/process is used, each rank writes just one severity value at non-contiguous locations (24 values in case of 24 threads/process) which is very little. Less writers are using their I/O nodes more efficiently. Dividing the time with the size of the Cube files gives us the overall band- width, which can be seen on figure 27. Again, with more system locations, the performance grows significantly.

58 Figure 27: Overall writing speed for H2O-64 benchmark Overall writing speed of Score-P for H2O-64 benchmark.

7.4 Discussion In the results shown above, we proved that the new parallel algorithm is a much better alternative to the old one. Tests with prototype showed that the new algorithm works for all cases, regardless of metrics, call tree or system tree. Main reason for higher performance comes from the removal of communication between processes, before writing severity values. Tests with the H2O-64 benchmark from CP2K showed that the new library works as expected and can already be used for real world applications. It must be noted that the new implementation also has some drawbacks. Its dependence on MPI interface is not suitable for all situations. For example, Score-P can be installed without MPI. This might be useful in cases, where we want to analyze the performance of pure OpenMP applications or applications that use different parallel libraries (SHMEM, OpenCL, CUDA). In such cases, user would not only need to have MPI installed, but the new algorithm would also not offer any better performance. Another thing that the new library lacks is the ability to write compressed data. In the current version of CubeWriter, compression is supported, and if enabled, the library severity values for every call path, before writing them to the file. This leads to much smaller metric data files, but it also breaks their internal repetitive structure. It is therefore not possible in the new algorithm.

59 Despite that, the new algorithm proves to be the right answer to slow band- width. In majority of HPC applications, above problems are not the issue and the advantage of faster writing outweighs other disadvantages. It should encour- age the Score-P project leaders to incorporate it into the future development.

60 8 Conclusion

Size of the application’s performance report, generated by performance mea- surement programs, depends heavily on the system and call tree of the applica- tion. The objective of this thesis was to rewrite the CubeWriter library, a part of Cube software for writing performance reports. We proposed a new algorithm in which processes write the file together in a parallel manner. The new algorithm was implemented and tested with various prototype runs and a CP2K benchmark. The results have shown that the new algorithm sig- nificantly outperforms the old one (especially on a large scale). Main reason for this is that processes do not have to gather the measured values on one process. Although the current new implementation is a big step forward, there are more ideas about the further development of the CubeWriter library. One of the reasons for choosing the old algorithm was its portability to non-MPI environments. Although this is not possible with the new version, one could rewrite the library in a way, that would remove this dependency. One way to do this is to implement wrapper structs for our MPI routines. These structs would ’hide’ the MPI environment, so that the installation would be possible without it. In such case, wrapper functions would then map the MPI functions back to the old algorithm and POSIX implementation. Another way of overcoming the same problem is to use the callback functions to Score-P. Score-P already provides its own internal wrapper functions that enable it to work without MPI. CubeWriter library is used by both Score-P and Scalasca. Currently, only Score-P was reconfigured to use the new library, but the same should be done for Scalasca. Because Scalasca produces more metrics than Score-P, speed-up would be even bigger in that case. After this is done, the new library could go into the production stage and one day be a part of stable release.

61 References

[1] Wikipedia contributors. (2018, July 8). Cray-1. In Wikipedia, The Free Ency- clopedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Cray- 1&oldid=849334844 [2] Sterling, T., & J. Becker, D., & Savarese, D., & E. Dorband, J., & A. Ranawake, U., & V. Packer, C. (1995). BEOWULF: A Parallel Worksta- tion for Scientific Computation. Proceedings, International Conference on Parallel Processing . 95 [3] List June 2018. (2018, June 25). Top 500. Retrieved from https://www.top500.org/lists/2018/06/ [4] Espinosa, A., & Margalef, T., & Luque, A. (1998) Automatic Performance Evaluation of Parallel Programs. Proceedings of the Sixth Euromicro Work- shop on Parallel and Distributed Processing -PDP ’98-. pp. 43-49. doi: 10.1109/EMPDP.1998.647178 [5] Adhianto, L., & Banerjee, S., & Fagan, M., & Krentel, M., & Marin, G.,& Mellor-Crummey, J.M., & Tallent, N.R. (2009, January). HPCTOOLKIT : Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22. doi:10.1002/cpe.1553 [6] Benedict, S., & Petkov, V., & Gerndt, M. (2009). PERISCOPE: An online- based distributed performance analysis tool. Proceedings of the 3rd Interna- tional Workshop on Parallel Tools for High Performance Computing 2009. pp. 1-16. doi:10.1007/978-3-642-11261-4 1 [7] Pillet, V., & Labarta, J., & Cortes, T., & Girona, S. (1995, March). PAR- AVER: A tool to visualize and analyze parallel code. WoTUG-18. 44. [8] Barcelona supercomputing centre. (n.d.) Dimemas: predict paral- lel performance using a single cpu machine. Tools. Retrieved from https://tools.bsc.es/dimemas [9] Geimer, M., & Wolf, F., & Wylie, B. J. N., & Abraham, E., & Becker, D., & Mohr, B. (2010). The Scalasca performance toolset archi- tecture. Concurrency and Computation: Practice & Experience - Scalable Tools for High-End Computing archive, Volume 22, Issue 6. pp. 702-719. doi:http://dx.doi.org/10.1002/cpe.v22:6 [10] Knpfer, A., & Feld, C., & Mey, D., & Biersdorff, S., & Diethelm, K., & Eschweiler, D., & Geimer, M., & Gerndt, M., & Lorenz, D., & Malony, A., & Nagel, ., & Oleynik, Y., & Philippen, P., & Saviankou, P., & Schmidl, D., & Shende, S., & Tschter, R., & Wagner, M., & Wesarg, B., & Wolf, F. (2012). Score-P: A Joint Performance Measurement Run-Time Infrastruc- ture for Periscope, Scalasca, TAU, and Vampir. Tools for High Performance Computing 2011, Chapter 7. pp. 79-91. doi:10.1007/978-3-642-31476-6 7.

62 [11] Scalasca development team. (n.d.) Perfor- mance properties. Retrieved from: https://apps.fz- juelich.de/scalasca/releases/scalasca/2.4/help/scalasca patterns-2.4.html [12] Message P Forum. (1994). MPI: A Message-Passing Interface Standard. Technical Report. University of Tennessee, Knoxville, TN, USA

[13] Thakur, R., & Gropp, W., & Lusk, E. (1998). A Case for Using MPI’s Derived Datatypes to Improve I/O Performance. SC ’98: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, Orlando, FL, USA. pp. 1. doi: 10.1109/SC.1998.10006

[14] Luu, H., & Winslett, M., & Gropp, W., & Ross, R.B., & Carns, P.H., & Harms, K., & Prabhat, & Byna, S., & Yao, Y. (2015). A Multiplat- form Study of I/O Behavior on Petascale Supercomputers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Dis- tributed Computing (HPDC ’15). pp. 33-44. doi:10.1145/2749246.2749269 [15] Saviankou, P., & Knobloch, M., & Visser, A., & Mohr, B. (2015). Cube v4: From Performance Report Explorer to Performance Analysis Tool. In- ternational Conference On Computational Science, ICCS 2015, Reykjavk, Iceland, 1 Jun 2015 - 3 Jun 2015 Procedia computer science 51. pp. 1343- 1352. doi:10.1016/j.procs.2015.05.320 [16] Scalasca development team. (2018, May 4). Cube- GUI 4.4 — User Guide. Retrieved from https://apps.fz- juelich.de/scalasca/releases/cube/4.4/docs/CubeUserGuide.pdf [17] Geimer, M., & Saviankou, P., & Strube, A., & Szebenyi, Z.,& Wolf, F., & J. N. Wylie, B. (2010). Further Improving the Scalability of the Scalasca Toolset. PARA’10 Proceedings of the 10th international confer- ence on Applied Parallel and Scientific Computing - Volume 2 pp. 463-473. doi:10.1007/978-3-642-28145-7 45. [18] Wikipedia contributors. (2018, August 3). Tree traver- sal. In Wikipedia, The Free Encyclopedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Tree traversal&oldid=853310883

[19] J¨ulich Supercomputing Centre. (2016). JURECA: General purpose super- computer at J¨ulich Supercomputing Centre. Journal of large-scale research facilities, 2, A62. doi:http://dx.doi.org/10.17815/jlsrf-4-121-1 [20] The CP2K developers group. (2013). CP2K version 6.1. CP2K is freely available from https://www.cp2k.org/

[21] Cube developer community. (2018, May 7). CubeW: High performance C Writer library, Version 4.4. doi: 10.5281/zenodo.1248061 (Source codes are freely available from https://zenodo.org/record/1248061)

63 A Appendix - source code

Here, the rewritten source codes for Score-P and the new CubeWriter library are given. We include only those files, which had to be modified or added. Fur- thermore, because we want to help the user to easier find the relevant parts, only those functions are shown, which are tightly connected to our new algorithm. (Parts that were explained in chapter 6.) Only .c files are shown, .h header files are omitted. Since the CubeWriter library is an open source software, all source files of the released version (with the old algorithm) can be downloaded for free (see [21]).

A.1 cubew cube.c

/∗ Creates and returns a data strucure cube t ∗/ /∗mpi v e r s i o n ∗/ c u b e t ∗ c u b e m p i create( char ∗ cube name , enum CubeFlavours t cv , i n t w o r l d r a n k , i n t w o r l d s i z e , i n t t o t a l t h r e a d s n u m b e r , i n t threads number , i n t t h r e a d s o f f s e t , enum b o o l t compressed) { i f ( c u b e w initialized() == 0 ) { c u b e w i n i t a l l o c s ( NULL, NULL, NULL ) ; }

cubew compressed = CUBE FALSE ; #i f d e f i n e d ( BACKEND CUBE COMPRESSED ) | | d e f i n e d ( FRONTEND CUBE COMPRESSED ) // UTILS WARNING( ”Compression not possible in MPI writing. Writing uncompressed. \ n ” ) ; #e n d i f

c u b e t ∗ t h i s = NULL ; c u b e w trace = ( getenv( ”CUBEW TRACE” ) != NULL ) ; i f ( c u b e w t r a c e ) { UTILS WARNING( ”CUBEW TRACE=%d\n ” , c u b e w t r a c e ) ; } /∗ allocate memory for cube ∗/ ALLOC( this , 1, cube t , MEMORY TRACING PREFIX ”Allocate cube t ” ) ; if ( this == NULL ) { return NULL; } /∗ construct dynamic arrays ∗/ c u b e c o n s t r u c t arrays( this ); t h i s > f i r s t c a l l = CUBE TRUE; t h i s > l o c k e d f o r writing = CUBE FALSE ; t h i s > c u b e flavour =cv; t h i s > m e t r i c s title =NULL; t h i s > c a l l t r e e title =NULL; t h i s > s y s t e m t r e e title = NULL; t h i s > s y s t e m t r e e writer = cube s y s t e m t r e e w r i t e r c r e a t e ( ) ;

t h i s > s e v f l a g = 1 ; t h i s > compressed = compressed; t h i s > cubename =cubew strdup( cube name ) ; t h i s > s i z e o f a n c h o r f i l e = 1 ;

/∗ information for MPI ∗/ t h i s > w o r l d r a n k = w o r l d r a n k ; t h i s > w o r l d s i z e = w o r l d s i z e ;

/∗ information about system locations for proper initialization of filetype and displacements ∗/ t h i s > t o t a l n u m b e r o f threads = total t h r e a d s n u m b e r ; t h i s > n u m b e r o f m y threads = threads n u m b e r ; t h i s > t h r e a d offset =threads o f f s e t ;

/∗ construct a parallel metric writer ∗/ t h i s > p a r a l l e l writer = parallel m e t r i c w r i t e r create( NULL ); p a r a l l e l m e t r i c w r i t e r i n i t ( t h i s > p a r a l l e l w r i t e r , t h i s > t o t a l n u m b e r o f t h r e a d s , t h i s > n u m b e r o f m y t h r e a d s ) ;

64 /∗ create tar layout and MPI file ∗/ t h i s > layout = cube m p i w r i t i n g s t a r t ( t h i s > cubename, this > c u b e f l a v o u r ) ; return this; }

/∗∗ ∗ Inform a given metric about cnodes, which have nonzero data. ∗ Metric transforms the bitstring according the its enumeration of cnodes. ∗/ v o i d c u b e s e t k n o w n c n o d e s f o r metric( cube t ∗ t h i s , c u b e m e t r i c ∗ m e t r i c , c h a r ∗ known cnodes ) { i f ( known cnodes == 0 ) { UTILS FATAL( ”Failed to set a bit vector of known cnodes. Received pointer is zero. \ n” ) ; } u i n t 6 4 t n f l a t locations = cube g e t s y s t e m t r e e information( this ) > n u m b e r l o c a t i o n s ; c u b e m e t r i c s e t u p f o r writing( metric, this > c n d a r , t h i s > r c n d a r , t h i s > l o c s a r > s i z e + n f l a t locations ); c u b e m e t r i c s e t k n o w n cnodes( metric , known cnodes ) ; }

/∗∗ ∗ Returns an array of defined call nodes for a particular metric ∗/ c a r r a y ∗ c u b e g e t c n o d e s f o r metric( cube t ∗ t h i s , c u b e m e t r i c ∗ m e t r i c ) { c a r r a y ∗ enm = c u b e m e t r i c r e t u r n enumeration( metric ); i f ( enm == NULL ) { u i n t 6 4 t n f l a t locations = cube g e t s y s t e m t r e e information( this ) > n u m b e r l o c a t i o n s ; c u b e m e t r i c s e t u p f o r writing( metric, this > c n d a r , t h i s > r c n d a r , t h i s > l o c s a r > s i z e + n f l a t locations ); }

enm = c u b e m e t r i c r e t u r n enumeration( metric );

r e t u r n enm ; }

v o i d c u b e w r i t e a l l s e v r o w s o f doubles( cube t ∗ t h i s , c u b e m e t r i c ∗ met , d o u b l e ∗ s e v s ) { i f ( t h i s > c u b e f l a v o u r == CUBE SLAVE ) { r e t u r n ; /∗ CUBE SLAVE not yet supported ∗/ }

/∗ prepare metrics for writing ∗/ c u b e p r e p a r e m e t r i c s f o r writing( this );

/∗ get number of call paths ∗/ i n t n c a l l p a t h s ; i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { i f ( met > m e t r i c f o r m a t == CUBE INDEX FORMAT DENSE ) { n callpaths = met > ncn ; } e l s e i f ( met > m e t r i c f o r m a t == CUBE INDEX FORMAT SPARSE ) { n callpaths = met > n u m b e r o f w r i t t e n c n o d e s ; } }

/∗ Broadcast number of call paths ∗/ MPI Bcast ( &n callpaths , 1, MPI INT , 0 , MPI WORLD ) ;

/∗ set up etype filetype ∗/ p a r a l l e l m e t r i c w r i t e r p r e p a r e f o r writing( this > p a r a l l e l w r i t e r , CUBE DATA TYPE DOUBLE, n callpaths );

/∗ write tar header ∗/ i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { //Writing of tar header file for the metric c u b e r e p o r t b e f o r e m p i m e t r i c writing( this > layout , met, this > p a r a l l e l w r i t e r ) ;

}

/∗ Broadcast file start position ∗/ MPI Bcast( &this > l a y o u t > f i l e s t a r t position , 1, MPI UINT64 T , 0 , MPI COMM WORLD ) ;

/∗ calcualates displacement of a view ∗/

65 c a l c u l a t e displacements( this > p a r a l l e l w r i t e r , t h i s > l a y o u t > f i l e s t a r t p o s i t i o n , t h i s > t h r e a d o f f s e t ) ;

/∗ write sev values ∗/ c u b e m e t r i c w r i t e a l l r o w s o f doubles( this > c u b e flavour , this > p a r a l l e l writer , met, this > l a y o u t > t a r , s e v s ) ;

/∗ finish writing of a metric ∗/ i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { c u b e r e p o r t a f t e r m p i m e t r i c writing( this > layout , this > p a r a l l e l w r i t e r ) ; }

/∗ Broadcast position of next metric tar header ∗/ MPI Bcast( &this > l a y o u t > h e a d e r position , 1, MPI UINT64 T , 0 , MPI COMM WORLD ) ; }

v o i d c u b e w r i t e a l l s e v r o w s o f uint64( cube t ∗ t h i s , c u b e m e t r i c ∗ met , u i n t 6 4 t ∗ s e v s ) { i f ( t h i s > c u b e f l a v o u r == CUBE SLAVE ) { r e t u r n ; /∗ CUBE SLAVE doesn ’t write anything” ∗/ }

/∗ prepare metrics for writing ∗/ c u b e p r e p a r e m e t r i c s f o r writing( this );

/∗ get number of call paths ∗/ i n t n c a l l p a t h s ; i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { i f ( met > m e t r i c f o r m a t == CUBE INDEX FORMAT DENSE ) { n callpaths = met > ncn ; } e l s e i f ( met > m e t r i c f o r m a t == CUBE INDEX FORMAT SPARSE ) { n callpaths = met > n u m b e r o f w r i t t e n c n o d e s ; } } /∗ Broadcast number of call paths ∗/ MPI Bcast ( &n callpaths , 1, MPI INT , 0 , MPI COMM WORLD ) ;

/∗ set up etype filetype ∗/ p a r a l l e l m e t r i c w r i t e r p r e p a r e f o r writing( this > p a r a l l e l w r i t e r , CUBE DATA TYPE UINT64 , n callpaths );

/∗ write tar header ∗/ i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { // Writing of tar header file for the metric c u b e r e p o r t b e f o r e m p i m e t r i c writing( this > layout , met, this > p a r a l l e l w r i t e r ) ; }

/∗ Broadcast file start position ∗/ MPI Bcast( &this > l a y o u t > f i l e s t a r t position , 1, MPI UINT64 T , 0 , MPI COMM WORLD ) ;

/∗ calcualates displacement of a view ∗/ c a l c u l a t e displacements( this > p a r a l l e l w r i t e r , t h i s > l a y o u t > f i l e s t a r t p o s i t i o n , t h i s > t h r e a d o f f s e t ) ;

/∗ write sev values ∗/ c u b e m e t r i c w r i t e a l l r o w s o f uint64( this > c u b e flavour , this > p a r a l l e l writer , met, this > l a y o u t > t a r , s e v s ) ;

/∗ finish writing of a metric ∗/ i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { c u b e r e p o r t a f t e r m p i m e t r i c writing( this > layout , this > p a r a l l e l w r i t e r ) ; }

/∗ Broadcast position of next metric tar header ∗/ MPI Bcast( &this > l a y o u t > h e a d e r position , 1, MPI UINT64 T , 0 , MPI COMM WORLD ) ; }

v o i d c u b e w r i t e a l l s e v r o w s o f c u b e t y p e t a u atomic( cube t ∗ t h i s , c u b e m e t r i c ∗ met , c u b e t y p e t a u a t o m i c ∗ s e v s ) { i f ( t h i s > c u b e f l a v o u r == CUBE SLAVE ) { r e t u r n ; /∗ CUBE SLAVE doesn ’t write anything” ∗/ }

/∗ prepare metrics for writing ∗/ c u b e p r e p a r e m e t r i c s f o r writing( this );

66 /∗ get number of call paths ∗/ i n t n c a l l p a t h s ; i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { i f ( met > m e t r i c f o r m a t == CUBE INDEX FORMAT DENSE ) { n callpaths = met > ncn ; } e l s e i f ( met > m e t r i c f o r m a t == CUBE INDEX FORMAT SPARSE ) { n callpaths = met > n u m b e r o f w r i t t e n c n o d e s ; } } /∗ Broadcast number of call paths ∗/ MPI Bcast ( &n callpaths , 1, MPI INT , 0 , MPI COMM WORLD ) ;

/∗ set up etype filetype ∗/ p a r a l l e l m e t r i c w r i t e r p r e p a r e f o r writing( this > p a r a l l e l w r i t e r , CUBE DATA TYPE TAU ATOMIC, n callpaths );

/∗ write tar header ∗/ i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { // Writing of tar header file for the metric c u b e r e p o r t b e f o r e m p i m e t r i c writing( this > layout , met, this > p a r a l l e l w r i t e r ) ; }

/∗ Broadcast file start position ∗/ MPI Bcast( &this > l a y o u t > f i l e s t a r t position , 1, MPI UINT64 T , 0 , MPI COMM WORLD ) ;

/∗ calcualates displacement of a view ∗/ c a l c u l a t e displacements( this > p a r a l l e l w r i t e r , t h i s > l a y o u t > f i l e s t a r t p o s i t i o n , t h i s > t h r e a d o f f s e t ) ;

/∗ write sev values ∗/ c u b e m e t r i c w r i t e a l l r o w s o f c u b e t y p e t a u atomic( this > c u b e flavour , this > p a r a l l e l writer , met, this > l a y o u t > t a r , s e v s ) ;

/∗ finish writing of a metric ∗/ i f ( t h i s > c u b e f l a v o u r == CUBE MASTER ) { c u b e r e p o r t a f t e r m p i m e t r i c writing( this > layout , this > p a r a l l e l w r i t e r ) ; }

/∗ Broadcast position of next metric tar header ∗/ MPI Bcast( &this > l a y o u t > h e a d e r position , 1, MPI UINT64 T , 0 , MPI COMM WORLD ) ; }

67 A.2 cubew metric.c

/∗ ======MPI W r i t i n g c a l l s ======∗/ /∗ W i l l be c a l l e d by MASTER and WRITER∗/ /∗ Assumes that data row contains values for all callpaths int the right order. ∗/ v o i d c u b e m e t r i c w r i t e a l l rows( enum CubeFlavours t f l a v o u r , p a r a l l e l m e t r i c w r i t e r ∗ w r i t e r , c u b e m e t r i c ∗ m e t r i c , MPI F i l e ∗ f i l e , v o i d ∗ d a t a r o w ) { if ( flavour == CUBE MASTER ) { i f ( m e t r i c > m e t r i c t y p e == CUBE METRIC POSTDERIVED | | m e t r i c > m e t r i c t y p e == CUBE METRIC PREDERIVED INCLUSIVE | | m e t r i c > m e t r i c t y p e == CUBE METRIC PREDERIVED EXCLUSIVE ) / / DERIVATED m e t r i c s do not s t o r e any data . { r e t u r n ; } } // out metric > i m writing = CUBE TRUE;

/∗ Write metric datafile marker ∗/ if ( flavour == CUBE MASTER ) { char datafile marker [ CUBE DATAFILE MARKER SIZE ] = CUBE DATAFILE MARKER ; MPI F i l e w r i t e ( ∗ file , datafile marker , CUBE DATAFILE MARKER SIZE , MPI CHAR, MPI STATUS IGNORE ) ; } MPI Barrier( MPI COMM WORLD ) ;

/∗ Getting displacement ∗/ MPI Offset disp = writer > w r i t i n g displacement ; MPI Type commit( &writer > e t y p e ) ; MPI Type commit( &writer > f i l e t y p e ) ;

/∗ Seeting correct view ∗/ MPI F i l e s e t v i e w ( ∗ file , disp, writer > etype , writer > filetype , ”native”, MPI INFO NULL ) ;

MPI Status stat; i n t e r r ; double ts =MPI Wtime ( ) ;

/∗ Write in parallel ∗/ e r r = M P I F i l e w r i t e a l l ( ∗ f i l e , data row , w r i t e r > n w r i t t e n c n o d e s ∗ w r i t e r > nthreads , writer > etype, &stat ); i f ( e r r ) { UTILS WARNING( ”[CUBEW Warning]: Parallel metric writing not succesful. \ n” ) ; } double te = MPI Wtime ( ) ; if ( flavour == CUBE MASTER ) { //( ”MPI F i l e w r i t e a l l %f \n ” , t e t s ) ; }

/∗ Setting view back to normal ∗/ disp = writer > f i l e displacement + CUBE DATAFILE MARKER SIZE + writer > r a w d a t a s i z e ; MPI F i l e s e t v i e w ( ∗ file , disp, MPI BYTE , MPI BYTE, ”native”, MPI INFO NULL ) ; MPI Barrier( MPI COMM WORLD ) ; }

/∗∗ ∗ Closes file and writes index file. ∗/ v o i d c u b e m e t r i c m p i finish( cube m e t r i c ∗ this , parallel m e t r i c w r i t e r ∗ w r i t e r ) { i f ( t h i s > i m finished == CUBE TRUE ) { r e t u r n ; } i f ( t h i s > d a t a f i l e == NULL ) { t h i s > i m finished = CUBE TRUE; r e t u r n ; }

c u b e r e p o r t m p i m e t r i c d a t a finish( this > layout, this, this > d a t a file , writer > r a w d a t a s i z e + CUBE DATAFILE MARKER SIZE ) ;

/∗ First we calculate the size of index file ∗/ u i n t 3 2 t s i z e i n d e x = 0 ; u i n t 6 4 t s i z e o f i n d e x f i l e = 0 ;

s i z e o f i n d e x f i l e += CUBE INDEXFILE MARKER SIZE + sizeof( metric h e a d e r ) ;

i f ( t h i s > m e t r i c f o r m a t == CUBE INDEX FORMAT SPARSE )

68 { i f ( t h i s > l a y o u t > c u b e f l a v o u r == CUBE MASTER ) { i f ( t h i s > known cnodes != 0 ) { s i z e i n d e x = c u b e m e t r i c s i z e o f i n d e x ( t h i s > known cnodes, ( unsigned )ceil( ( double )this > ncn / 8 . ) ) ;

s i z e o f i n d e x file += ( size i n d e x + 1 ) ∗ sizeof( uint32 t ) ; } } }

//MPI Bcast( &size o f i n d e x f i l e , 1 , MPI INT , 0 , MPI COMM WORLD ) ;

/∗We write tar header for index file ∗/ MPI F i l e ∗ i f i l e = c u b e r e p o r t m p i m e t r i c i n d e x s t a r t ( t h i s > layout, this, size o f i n d e x f i l e ) ;

MPI Offset disp = this > l a y o u t > f i l e s t a r t p o s i t i o n ;

/∗ Write index header ∗/ i f ( t h i s > l a y o u t > c u b e f l a v o u r == CUBE MASTER ) { m e t r i c header mheader;

mheader.named.endian = 1; mheader.named.version = 0; mheader .named. metric format = ( uint8 t ) ( t h i s > m e t r i c f o r m a t ) ; c h a r ∗ marker = CUBE INDEXFILE MARKER ;

MPI Status status; MPI F i l e w r i t e ( ∗ i f i l e , marker, 11, MPI CHAR, &status ); MPI F i l e w r i t e ( ∗ ifile , &( mheader.named.endian ), 1, MPI UINT32 T, &status ); MPI F i l e w r i t e ( ∗ ifile , &( mheader.named.version ), 1, MPI UINT16 T, &status ); MPI F i l e w r i t e ( ∗ ifile , &( mheader.named.metric format ), 1, MPI UINT8 T, &status ); }

/∗ In case of sparse metric: write the indeces of right callpaths ∗/ // d i s p += CUBE INDEXFILE MARKER SIZE + sizeof( metric h e a d e r ) ; //MPI F i l e s e t v i e w ( ∗ ifile , disp, MPI UINT32 T,MPI UINT32 T, ”native”, MPI INFO NULL ) ;

u i n t 3 2 t ∗ i n d e x intofile = 0;

i f ( t h i s > m e t r i c f o r m a t == CUBE INDEX FORMAT SPARSE ) { i f ( t h i s > known cnodes != 0 ) { i n d e x intofile = ( uint32 t ∗ ) c u b e m e t r i c c r e a t e i n d e x ( t h i s > known cnodes , ( unsigned )ceil( ( double )this > ncn / 8 . ) ) ; }

i f ( t h i s > l a y o u t > c u b e f l a v o u r == CUBE MASTER ) { MPI Status status; MPI F i l e w r i t e ( ∗ ifile , &size index , 1, MPI UINT32 T, &status ); MPI F i l e w r i t e ( ∗ ifile , index intofile , size i n d e x , MPI UINT32 T, &status ); CUBEW FREE( i n d e x i n t o f i l e , MEMORY TRACING PREFIX ”Release index intofile” ); } }

//MPI Barrier( MPI COMM WORLD ) ;

/∗ Finish writing index file ∗/ c u b e r e p o r t m p i m e t r i c i n d e x finish( this > layout, this, ifile , size o f i n d e x f i l e ) ;

t h i s > i m finished = CUBE TRUE; }

v o i d c u b e m e t r i c s e t k n o w n cnodes( cube m e t r i c ∗ m e t r i c , c h a r ∗ known cnodes /∗ , unsigned size ∗/) { m e t r i c > m e t r i c format = ( known cnodes == NULL ) ? CUBE INDEX FORMAT DENSE : CUBE INDEX FORMAT SPARSE ;

i f ( known cnodes == 0 ) { UTILS WARNING( ”Failed to set a bit vector of known cnodes. Received pointer is zero. \ n” ) ; } CUBEW FREE( metric > known cnodes , MEMORY TRACING PREFIX ”Release previos list of known cnodes” );

known cnodes = c u b e m e t r i c b i t s t r i n g transformation( metric , known cnodes ) ; m e t r i c > known cnodes = known cnodes ;

m e t r i c > n u m b e r o f w r i t t e n c n o d e s = c u b e b i t count( known cnodes, ( metric > ncn + 7 ) / 8 ) ; #i f d e f i n e d ( BACKEND CUBE COMPRESSED ) | | d e f i n e d ( FRONTEND CUBE COMPRESSED ) i f ( m e t r i c > compressed == CUBE TRUE ) { c u b e m e t r i c s e t u p subindex( metric );

69 } #e n d i f /∗ HAVE LIB Z ∗/ }

v o i d c u b e m e t r i c w r i t e a l l r o w s o f doubles( enum CubeFlavours t f l a v o u r , p a r a l l e l m e t r i c w r i t e r ∗ w r i t e r , c u b e m e t r i c ∗ m e t r i c , MPI F i l e ∗ f i l e , d o u b l e ∗ d a t a r o w ) { /∗ c u b e m e t r i c trasnsforms ... always allocate new memory. If no transformation > then copy ∗/

v o i d ∗ t a r g e t r o w = ( v o i d ∗ ) c u b e m e t r i c t r a n s f o r m r o w o f doubles( writer , data r o w ) ;

c u b e m e t r i c w r i t e a l l rows( flavour, writer, metric, file , target r o w ) ; /∗ release allocated memory∗/ CUBEW FREE( target r o w , MEMORY TRACING PREFIX ”Release row of doubles” ); }

v o i d c u b e m e t r i c w r i t e a l l r o w s o f uint64( enum CubeFlavours t f l a v o u r , p a r a l l e l m e t r i c w r i t e r ∗ w r i t e r , c u b e m e t r i c ∗ m e t r i c , MPI F i l e ∗ f i l e , u i n t 6 4 t ∗ d a t a r o w ) { /∗ c u b e m e t r i c trasnsforms ... always allocate new memory. If no transformation > then copy ∗/ v o i d ∗ t a r g e t r o w = ( v o i d ∗ ) c u b e m e t r i c t r a n s f o r m r o w o f uint64( writer , data r o w ) ;

c u b e m e t r i c w r i t e a l l rows( flavour , writer, metric, file , target r o w ) ; /∗ release allocated memory∗/ CUBEW FREE( target r o w , MEMORY TRACING PREFIX ”Release row of uint64” ); }

v o i d c u b e m e t r i c w r i t e a l l r o w s o f c u b e t y p e t a u atomic( enum CubeFlavours t f l a v o u r , p a r a l l e l m e t r i c w r i t e r ∗ w r i t e r , c u b e m e t r i c ∗ m e t r i c , MPI F i l e ∗ f i l e , c u b e t y p e t a u a t o m i c ∗ d a t a r o w ) { if ( flavour == CUBE MASTER ) { i f ( CUBE DATA TYPE TAU ATOMIC != m e t r i c > dtype params > t yp e ) { r e t u r n ; } } c u b e m e t r i c w r i t e a l l rows( flavour, writer, metric, file , data r o w ) ; }

70 A.3 cubew parallel metric writer.c

/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗∗ CUBE http://www.scalasca.org/ ∗∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗∗ Copyright (c) 1998 2018 ∗∗ ∗∗ Forschungszentrum Juelich GmbH, Juelich Supercomputing Centre ∗∗ ∗∗ ∗∗ ∗∗ This software may be modified and distributed under the terms of ∗∗ ∗∗ a BSD style license. See the COPYING file in the base ∗∗ ∗∗ directory for details. ∗∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/

/∗∗ ∗ \ f i l e c u b e w p a r a l l e l m e t r i c w r i t e r . c \ brief Creates types needed for parallel metric writing ∗/ #include ”config.h” #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e

#include ”cubew m e t r i c . h” #include ”cubew p a r a l l e l m e t r i c w r i t e r . h” #include ”cubew services .h” #include ”cubew v e c t o r . h” #include ”cubew t y p e s . h” #include ”cubew memory . h”

#d e f i n e MEMORY TRACING PREFIX ” [PARALLEL METRIC WRITER] ”

/∗ Creates parallel metric writer structure ∗/ p a r a l l e l m e t r i c w r i t e r ∗ p a r a l l e l m e t r i c w r i t e r create( parallel m e t r i c w r i t e r ∗ w r i t e r ) { if ( writer == NULL ) { ALLOC( writer , 1, parallel m e t r i c w r i t e r , MEMORY TRACING PREFIX ”Allocate parallel metric writer” ); } return writer; }

/∗ Initializes parallel metric structure fields ∗/ v o i d p a r a l l e l m e t r i c w r i t e r init( parallel m e t r i c w r i t e r ∗ t h i s , u i n t 6 4 t t o t a l n u m b e r o f t h r e a d s , i n t n u m b e r o f m y t h r e a d s ) { t h i s > t o t a l nthreads = total n u m b e r o f t h r e a d s ; t h i s > nthreads = number o f m y t h r e a d s ;

/∗ we do not know datatype of first metric yet ∗/ t h i s > datatype = CUBE DATA TYPE UNKNOWN;

/∗ setting etype and filetype to default ∗/ t h i s > e t y p e = MPI BYTE ; t h i s > filetype = MPI BYTE ;

t h i s > w r i t i n g displacement = 0; t h i s > n w r i t t e n c n o d e s = 0 ; t h i s > e t y p e s i z e = 0 ; t h i s > e x t e n t = 0 ; t h i s > r a w d a t a s i z e = 0 ; }

/∗ Constructs the right etype and filetype ∗/ v o i d p a r a l l e l m e t r i c w r i t e r p r e p a r e f o r writing( parallel m e t r i c w r i t e r ∗ t h i s , i n t type , i n t n u m b e r o f w r i t t e n c n o d e s ) { t h i s > datatype = type;

/∗ ∗ Setting right etype ∗ ∗ Score P uses just UINT64 T , DOUBLE and TAU ATOMIC ∗ ∗/ switch ( type ) { c a s e CUBE DATA TYPE INT64 : t h i s > e t y p e = MPI INT64 T; b r e a k ; c a s e CUBE DATA TYPE UINT64 : t h i s > e t y p e = MPI UINT64 T;

71 b r e a k ; c a s e CUBE DATA TYPE INT32 : t h i s > e t y p e = MPI INT32 T; b r e a k ; c a s e CUBE DATA TYPE UINT32 : t h i s > e t y p e = MPI UINT32 T; b r e a k ; c a s e CUBE DATA TYPE INT16 : t h i s > e t y p e = MPI INT16 T; b r e a k ; c a s e CUBE DATA TYPE UINT16 : t h i s > e t y p e = MPI UINT16 T; b r e a k ; c a s e CUBE DATA TYPE INT8 : t h i s > e t y p e = MPI INT8 T; b r e a k ; c a s e CUBE DATA TYPE UINT8 : t h i s > e t y p e = MPI UINT8 T; b r e a k ; c a s e CUBE DATA TYPE DOUBLE | | CUBE DATA TYPE MIN DOUBLE | | CUBE DATA TYPE MAX DOUBLE : t h i s > e t y p e = MPI DOUBLE ; b r e a k ; c a s e CUBE DATA TYPE TAU ATOMIC : c r e a t e m p i t a u atomic( &this > e t y p e ) ; b r e a k ; d e f a u l t : t h i s > e t y p e = MPI DOUBLE ; b r e a k ; }

/∗ Commit right etype ∗/ MPI Type commit( &this > e t y p e ) ; MPI T y p e s i z e ( t h i s > etype , &this > e t y p e s i z e ) ;

/∗ number of call paths in file ∗/ t h i s > n w r i t t e n cnodes = number o f w r i t t e n c n o d e s ;

/∗ extent of the file type ∗/ t h i s > extent = this > t o t a l n t h r e a d s ∗ t h i s > e t y p e s i z e ;

MPI Datatype tmp contiguous ;

/∗ constructing the right filetype ∗/ MPI Type contiguous( this > nthreads , this > etype , &t m p contiguous ); MPI Type commit( &tmp contiguous ); MPI T y p e c r e a t e resized( tmp contiguous , 0, this > extent , &this > f i l e t y p e ) ; MPI Type commit( &this > f i l e t y p e ) ;

/∗ calculating size of data written in parallel ∗/ t h i s > r a w d a t a s i z e = t h i s > t o t a l n t h r e a d s ∗ t h i s > n w r i t t e n c n o d e s ∗ t h i s > e t y p e s i z e ; }

/∗ Calculates the right displacements ∗/ v o i d c a l c u l a t e displacements( parallel m e t r i c w r i t e r ∗ t h i s , u i n t 6 4 t f i l e s t a r t p o s i t i o n , i n t f i r s t t h r e a d p o s i t i o n ) { t h i s > f i l e displacement = file s t a r t p o s i t i o n ; t h i s > w r i t i n g displacement = file s t a r t position + CUBE DATAFILE MARKER SIZE + f i r s t t h r e a d p o s i t i o n ∗ t h i s > e t y p e s i z e ; }

/∗ Frees the allocated data ∗/ v o i d p a r a l l e l m e t r i c w r i t e r free( parallel m e t r i c w r i t e r ∗ t h i s ) { /∗ Release parallel metric writer ∗/ CUBEW FREE( t h i s , MEMORY TRACING PREFIX ”Release paralel metric writer” ); }

72 A.4 cubew tar writing.c

/∗ Initializes tar writer and opens a file ∗/ r e p o r t l a y o u t w r i t e r ∗ c u b e m p i w r i t i n g s t a r t ( c h a r ∗ cubename, enum CubeFlavours t c f ) { r e p o r t l a y o u t w r i t e r ∗ t a r writer = ( tar w r i t e r t ∗ )CUBEW CALLOC( 1, sizeof( tar w r i t e r t ) , MEMORY TRACING PREFIX ”Allocate tar writer” ); t a r w r i t e r > cubename = cubew strdup( cubename ); t a r w r i t e r > mode = c u b e w strdup( ”0000600” ); t a r w r i t e r > username = cubew strdup( getenv( ”USER” ) ); i f ( t a r w r i t e r > username == NULL ) { t a r w r i t e r > username = cubew strdup( getenv( ”” ) ); } i f ( t a r w r i t e r > username == NULL ) { t a r w r i t e r > username = cubew strdup( ”nouser” ); }

t a r w r i t e r > group = ( char ∗ )CUBEW CALLOC( 3 2 , s i z e o f ( c h a r ) , MEMORY TRACING PREFIX ”Allocate group name” ); s t r c p y ( t a r w r i t e r > group, ”users” ); t a r w r i t e r > uid =getuid(); t a r w r i t e r > gid =getgid(); t a r w r i t e r > a c t u a l t a r f i l e = c u b e g e t t a r e d c u b e name( cubename );

t a r w r i t e r > a c t u a l metric =NULL; t a r w r i t e r > a c t u a l t a r header = NULL; t a r w r i t e r > h e a d e r position = 0; t a r w r i t e r > f i l e s t a r t position = 0; t a r w r i t e r > a n c h o r writing =CUBE FALSE ; t a r w r i t e r > c u b e flavour =cf; t a r w r i t e r > tar =malloc(sizeof(MPI F i l e ) ) ; t a r w r i t e r > t a r datatype = malloc( sizeof( MPI Datatype ) );

/∗We create a commmunicator of only writers which open the file FOR FUTURE IMPLEMENTATION

MPI Group w o r l d g r o u p ; MPI Comm group ( MPI COMM WORLD, &w o r l d g r o u p ) ;

MPI Group writers g r o u p ; MPI G r o up i n c l ( w o r l d group , writers size , writer ranks , &writers g r o u p ) ;

MPI Comm create group ( MPI COMM WORLD, w r i t e r s group, 0, tar w r i t e r > w r i t e r s communicator );

t a r w r i t e r > w r i t e r s r a n k = 1 ; i f ( MPI COMM NULL != ∗ t a r w r i t e r > w r i t e r s communicator ) { MPI Comm rank ( ∗ t a r w r i t e r > w r i t e r s communicator , &tar w r i t e r > w r i t e r s r a n k ) ; } MPI Comm size ( ∗ t a r w r i t e r > w r i t e r s communicator , &tar w r i t e r > w r i t e r s s i z e ) ;

MPI Group free( &world g r o u p ) ; MPI Group free( &writers g r o u p ) ;

/∗MPI COMM WORLD opens the file ∗/ i n t r c ; r c = M P I F i l e o p e n ( MPI COMM WORLD, t a r w r i t e r > a c t u a l t a r f i l e , MPI MODE RDWR | MPI MODE CREATE, MPI INFO NULL , t a r w r i t e r > t a r ) ; i f ( r c ) { UTILS WARNING( ”Cannot open tared cube file %s. \n ” , t a r w r i t e r > a c t u a l t a r f i l e ) ; perror( ”The following error occurred” ); UTILS WARNING( ” Return NULL. \ n” ) ; } MPI F i l e s e t v i e w ( ∗ t a r w r i t e r > t a r , 0 , MPI BYTE , MPI BYTE, ”native”, MPI INFO NULL ) ;

MPI Barrier( MPI COMM WORLD ) ;

m p i t a r c r e a t e h e a d e r t y p e ( t a r w r i t e r > t a r d a t a t y p e ) ; r e t u r n t a r w r i t e r ; }

v o i d c u b e r e p o r t b e f o r e m p i m e t r i c writing( report l a y o u t w r i t e r ∗ t a r writer , cube m e t r i c ∗ met, parallel m e t r i c w r i t e r ∗ w r i t e r ) { i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE SLAVE ) { r e t u r n ; }

met > d a t a f i l e = c u b e r e p o r t m p i m e t r i c d a t a s t a r t ( t a r writer , met, w r i t e r > r a w d a t a s i z e + CUBE DATAFILE MARKER SIZE ) ; t a r w r i t e r > a c t u a l metric = met; }

73 v o i d c u b e r e p o r t a f t e r m p i m e t r i c writing( report l a y o u t w r i t e r ∗ t a r writer , parallel m e t r i c w r i t e r ∗ w r i t e r ) { c u b e m e t r i c m p i f i n i s h ( t a r w r i t e r > a c t u a l metric , writer ); }

/∗ writes TAR header for metric data file ∗/ MPI F i l e ∗ c u b e r e p o r t m p i m e t r i c d a t a start( report l a y o u t w r i t e r ∗ t a r writer , cube m e t r i c ∗ met , u i n t 6 4 t m e t r i c s i z e ) { i f ( t a r w r i t e r == NULL ) { UTILS WARNING( ”Non stanard run. Create faked tar writer with temp name of cube \” NOFILE \ ”. \n” ) ; t a r writer = cube w r i t i n g s t a r t ( ” NOFILE ” , CUBE MASTER ) ; }

c h a r ∗ dataname = cube g e t p a t h t o m e t r i c d a t a ( t a r w r i t e r > cubename, met );

MPI Offset header o f f s e t = t a r w r i t e r > h e a d e r p o s i t i o n ;

i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE MASTER ) { MPI Status status; i f ( m e t r i c s i z e > MAX FILESIZE IN TAR) { /∗ 1. tar header ∗/ c h a r ∗ paxname = calloc( 1, 11 + strlen( dataname ) ); sprintf( paxname, ”Paxheader/%s”, dataname ); t a r w r i t e r > a c t u a l t a r h e a d e r = c u b e c r e a t e t a r h e a d e r w i t h p a x support( tar w r i t e r , dataname, ”x”, TAR BLOCKSIZE ) ;

MPI Status status; MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , t a r w r i t e r > a c t u a l t a r header , sizeof( tar g n u header ), MPI CHAR, &status ); CUBEW FREE( paxname , MEMORY TRACING PREFIX ”Release report l a y o u t writer metric pax name” );

/∗ 2. extended pax, header records ∗/ c h a r ∗ paxfile = calloc( 1, TAR BLOCKSIZE ) ; f i l l p a x file( paxfile , metric s i z e ) ; MPI F i l e w r i t e ( ∗ t a r w r i t e r > tar, paxfile , TAR BLOCKSIZE , MPI CHAR, &status ); CUBEW FREE( p a x f i l e , MEMORY TRACING PREFIX ”Release report l a y o u t writer pax file” );

/∗ 3. tar header ∗/ /∗ size gets overwritten by pax ∗/ t a r w r i t e r > a c t u a l t a r h e a d e r = c u b e c r e a t e t a r h e a d e r w i t h p a x support( tar writer , dataname, ”0”, 1234 ); MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , t a r w r i t e r > a c t u a l t a r header , sizeof( tar g n u header ), MPI CHAR, &status );

/∗ Updates file start position, header size is 3 ∗ 512 ∗/ t a r w r i t e r > f i l e s t a r t position = tar w r i t e r > h e a d e r p o s i t i o n + 3 ∗ TAR BLOCKSIZE ; } e l s e { t a r w r i t e r > a c t u a l t a r h e a d e r = c u b e c r e a t e t a r h e a d e r w i t h p a x support( tar writer , dataname, ”0”, metric s i z e ) ; MPI Status status; MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , t a r w r i t e r > a c t u a l t a r header , sizeof( tar g n u header ), MPI CHAR, &status );

t a r w r i t e r > f i l e s t a r t position = tar w r i t e r > h e a d e r position + TAR BLOCKSIZE ; } }

CUBEW FREE( dataname , MEMORY TRACING PREFIX ”Release report l a y o u t writer metric data name” ); r e t u r n t a r w r i t e r > t a r ; }

/∗ Writes end of metric data file ∗/ v o i d c u b e r e p o r t m p i m e t r i c d a t a finish( report l a y o u t w r i t e r ∗ t a r writer , cube m e t r i c ∗ met , M P I F i l e ∗ file , uint64 t s i z e ) { c u b e t a r m p i f i l e f i n i s h ( t a r writer, size ); }

/∗ Writes TAR header for metric index file ∗/ MPI F i l e ∗ c u b e r e p o r t m p i m e t r i c i n d e x start( report l a y o u t w r i t e r ∗ t a r writer , cube m e t r i c ∗ met , u i n t 6 4 t i n d e x s i z e ) { MPI Offset header o f f s e t = t a r w r i t e r > h e a d e r p o s i t i o n ;

i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE MASTER ) { c h a r ∗ indexname = cube g e t p a t h t o m e t r i c i n d e x ( t a r w r i t e r > cubename, met ); t a r w r i t e r > a c t u a l t a r h e a d e r = c u b e c r e a t e t a r h e a d e r w i t h p a x support( tar writer , indexname, ”0”, index s i z e ) ; CUBEW FREE( indexname , MEMORY TRACING PREFIX ”Release report l a y o u t writer metric data name” );

MPI Status status; MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , t a r w r i t e r > a c t u a l t a r header , sizeof( tar g n u header ), MPI CHAR, &status );

74 /∗ Updates file start position ∗/ t a r w r i t e r > f i l e s t a r t position = tar w r i t e r > h e a d e r position + sizeof( tar g n u h e a d e r ) ; }

r e t u r n t a r w r i t e r > t a r ; }

/∗ Writes end of metric index file ∗/ v o i d c u b e r e p o r t m p i m e t r i c i n d e x finish( report l a y o u t w r i t e r ∗ t a r writer , cube m e t r i c ∗ met , M P I F i l e ∗ file , uint64 t s i z e ) { c u b e t a r m p i f i l e f i n i s h ( t a r writer, size ); }

/∗ Mpi calls for tar writing tar layout ∗/ /∗ writes TAR header for anchor file ∗/ MPI F i l e ∗ c u b e r e p o r t m p i a n c h o r start( report l a y o u t w r i t e r ∗ t a r w r i t e r ) { MPI Offset header o f f s e t = t a r w r i t e r > h e a d e r p o s i t i o n ;

i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE MASTER ) { c h a r ∗ anchorname = cube g e t p a t h t o a n c h o r ( t a r w r i t e r > cubename ) ; /∗ Creates header without file size , we seek back after ∗/ t a r w r i t e r > a c t u a l t a r h e a d e r = c u b e c r e a t e t a r h e a d e r w i t h p a x support( tar writer , anchorname, ”0”, 1234 ); CUBEW FREE( anchorname , MEMORY TRACING PREFIX ”Release report l a y o u t writer metric data name” );

MPI Status status; /∗ Writes tar header ∗/ MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , t a r w r i t e r > a c t u a l t a r header , sizeof( tar g n u header ), MPI CHAR, &status ); /∗ Updates file start position , assumes its smaller than 8.6 GB ∗/ t a r w r i t e r > f i l e s t a r t position = tar w r i t e r > h e a d e r position + sizeof( tar g n u h e a d e r ) ; }

//MPI Bcast ( &t a r w r i t e r > f i l e s t a r t position , 1, MPI UNSIGNED LONG LONG, 0 , MPI COMM WORLD ) ; t a r w r i t e r > a n c h o r writing = CUBE TRUE; //MPI Barrier( MPI COMM WORLD ) ;

r e t u r n t a r w r i t e r > t a r ; } /∗ Writes end of anchor file ∗/ v o i d c u b e r e p o r t m p i a n c h o r finish( report l a y o u t w r i t e r ∗ t a r w r i t e r , M P I F i l e ∗ file , uint64 t a n c h o r s t a r t ) { MPI Offset anchor e n d ; i n t s i z e ; i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE MASTER ) { MPI F i l e g e t p o s i t i o n ( ∗ file , &anchor e n d ) ; }

i f ( ( u i n t 6 4 t ) a n c h o r e n d a n c h o r s t a r t > MAX FILESIZE IN TAR) { UTILS WARNING( ”Size of anchor file more than 8589934591, tar corrupted” ); perror( ”The following error occurred” ); }

c u b e t a r m p i a n c h o r f i n i s h ( t a r writer , ( uint64 t ) a n c h o r e n d a n c h o r s t a r t ) ; }

/∗ Writes two empty tar headers for the end of tar archive ∗/ v o i d c u b e t a r m p i finish( report l a y o u t w r i t e r ∗ t a r w r i t e r ) { MPI Offset header o f f s e t = t a r w r i t e r > h e a d e r p o s i t i o n ;

//MPI F i l e s e t v i e w ( ∗ t a r w r i t e r > t a r , h e a d e r o f f s e t , MPI CHAR, MPI CHAR, ”native”, MPI INFO NULL ) ; //MPI F i l e s e e k ( ∗ t a r w r i t e r > t a r , h e a d e r o f f s e t , MPI SEEK SET ) ; i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE MASTER ) { t a r e m p t y b l o c k ∗ b l o c k = ( t a r e m p t y b l o c k ∗ )CUBEW CALLOC( 1, sizeof( tar e m p t y b l o c k ) , MEMORY TRACING PREFIX ”Allocate tarblock ” ); MPI Status status; MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , ( c h a r ∗ )block, sizeof( tar g n u header ), MPI CHAR, &status ); MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , ( c h a r ∗ )block, sizeof( tar g n u header ), MPI CHAR, &status ); CUBEW FREE( b l o c k , MEMORY TRACING PREFIX ”Release block ” ); } }

/∗ Creates tar header with pax support ∗/ t a r g n u h e a d e r ∗ c u b e c r e a t e t a r h e a d e r w i t h p a x support( report l a y o u t w r i t e r ∗ t a r writer , char ∗ filename , char ∗ typeflag , uint64 t s i z e ) { t a r g n u h e a d e r ∗ t a r header = ( tar g n u h e a d e r ∗ )CUBEW CALLOC( 1, sizeof( tar g n u h e a d e r ) , MEMORY TRACING PREFIX ”Allocate tar header” ); memcpy ( t a r h e a d e r > name, filename, strlen( filename ) );

75 memcpy ( t a r h e a d e r > mode , t a r w r i t e r > mode, strlen( tar w r i t e r > mode ) ) ; sprintf( tar h e a d e r > uid, ”%7.7lo”, ( unsigned long )( tar w r i t e r > u i d ) ) ; sprintf( tar h e a d e r > gid, ”%7.7lo”, ( unsigned long )( tar w r i t e r > g i d ) ) ; unsigned int mtime = time( NULL ); sprintf( tar h e a d e r > mtime, ”%11.11lo”, ( unsigned long )mtime );

/∗ Support for pax header ∗/ memcpy ( t a r h e a d e r > typeflag , typeflag , 1 );

memcpy ( t a r h e a d e r > uname , t a r w r i t e r > username, strlen( tar w r i t e r > username ) ); memcpy ( t a r h e a d e r > gname , t a r w r i t e r > group, strlen( tar w r i t e r > group ) ) ;

c h a r ∗ magic = ”ustar”; memcpy ( t a r h e a d e r > magic , magic, strlen( m a g i c ) ) ; c h a r ∗ version = ”00”; memcpy ( t a r h e a d e r > v e r s i o n , version , strlen( v e r s i o n ) ) ;

c u b e s e t s i z e a n d c a l c u l a t e checksum( tar header, size );

r e t u r n t a r h e a d e r ; }

/∗ Writes 512bytes padding to the file in tar archive ∗/ v o i d c u b e t a r m p i f i l e finish( report l a y o u t w r i t e r ∗ t a r writer , uint64 t s i z e ) { u i n t 6 4 t e n d position = tar w r i t e r > f i l e s t a r t position + size; /∗ Calculating the difference to 512 bytes ∗/ u i n t 6 4 t difference = ( ( size / sizeof( tar e m p t y b l o c k ) + 1 ) ∗ s i z e o f ( t a r e m p t y block ) size );

i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE MASTER ) { c h a r ∗ tmp = ( c h a r ∗ )CUBEW CALLOC( difference , sizeof( char ), MEMORY TRACING PREFIX ”Allocate of a tar block” ); MPI Status stat; MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , tmp, difference , MPI CHAR, &stat ); CUBEW FREE( tmp , MEMORY TRACING PREFIX ”Release tail in a tar block” ); }

CUBEW FREE( t a r w r i t e r > a c t u a l t a r h e a d e r , MEMORY TRACING PREFIX ”Release tar header” );

t a r w r i t e r > a c t u a l t a r header = NULL; t a r w r i t e r > h e a d e r position = end position + difference; }

/∗ Updates the tar header of anchor (size , checksum) ∗/ v o i d c u b e t a r m p i a n c h o r finish( report l a y o u t w r i t e r ∗ t a r writer , uint64 t s i z e ) { u i n t 6 4 t e n d position = tar w r i t e r > f i l e s t a r t position + size; /∗ Calculating the difference to 512 bytes ∗/ u i n t 6 4 t difference = ( ( size / sizeof( tar e m p t y b l o c k ) + 1 ) ∗ s i z e o f ( t a r e m p t y block ) size );

/∗ Setting view and writing file padding to 512bytes ∗/ /∗ Writes padding to 512 bytes + two empty tar headers ∗/ i f ( t a r w r i t e r > c u b e f l a v o u r == CUBE MASTER ) { c h a r ∗ tmp = ( c h a r ∗ )CUBEW CALLOC( difference , sizeof( char ), MEMORY TRACING PREFIX ”Allocate tail of a tar block” ); MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , tmp, difference , MPI CHAR, MPI STATUS IGNORE ) ; CUBEW FREE( tmp , MEMORY TRACING PREFIX ”Release tail in a tar block” ); t a r e m p t y b l o c k ∗ b l o c k = ( t a r e m p t y b l o c k ∗ )CUBEW CALLOC( 1, sizeof( tar e m p t y b l o c k ) , MEMORY TRACING PREFIX ”Allocate tarblock ” ); MPI Status status; MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , ( c h a r ∗ )block, sizeof( tar g n u header ), MPI CHAR, &status ); MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , ( c h a r ∗ )block, sizeof( tar g n u header ), MPI CHAR, &status ); CUBEW FREE( b l o c k , MEMORY TRACING PREFIX ”Release block ” ); }

/∗ Seeks back to tar header of anchor file ∗/ MPI Offset header o f f s e t = t a r w r i t e r > h e a d e r p o s i t i o n ; MPI F i l e s e e k ( ∗ t a r w r i t e r > tar, size difference 512 1024, MPI SEEK CUR ) ;

c h a r ∗ anchorname = cube g e t p a t h t o a n c h o r ( t a r w r i t e r > cubename ) ; t a r w r i t e r > a c t u a l t a r h e a d e r = c u b e c r e a t e t a r h e a d e r w i t h p a x support( tar writer , anchorname, ”0”, size ); MPI F i l e w r i t e ( ∗ t a r w r i t e r > t a r , t a r w r i t e r > a c t u a l t a r header , sizeof( tar g n u header ), MPI CHAR, MPI STATUS IGNORE ) ;

CUBEW FREE( anchorname , MEMORY TRACING PREFIX ”Release report l a y o u t writer metric data name” );

CUBEW FREE( t a r w r i t e r > a c t u a l t a r h e a d e r , MEMORY TRACING PREFIX ”Release tar header” );

t a r w r i t e r > a c t u a l t a r header = NULL; t a r w r i t e r > h e a d e r position = end position + difference; }

76 A.5 scorep profile cube4 writer.c

/∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ Main writer function ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ v o i d s c o r e p p r o f i l e w r i t e cube4( SCOREP Profile OutputFormat format ) { / ∗ Variable definition ∗/

/∗ Pointer to Cube 4 metric definition. Only used on Rank 0 ∗/ c u b e m e t r i c ∗ metric = NULL;

/∗ Data set for Cube write functions ∗/ s c o r e p c u b e w r i t i n g d a t a w r i t e s e t ;

/∗ The CUBE layout description ∗/ s c o r e p c u b e layout layout;

UTILS PRINTF( SCOREP DEBUG PROFILE, ”Writing profile in Cube 4 format ...” );

/∗ Initialization , header and definitions ∗/

UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ”Prepare writing” );

SCOREP Ipc Group∗ comm = SCOREP IPC GROUP WORLD ; i f ( SCOREP UseSystemTreeSequence() ) { comm = scorep s y s t e m t r e e s e q g e t i p c g r o u p ( ) ; }

i f ( ! i n i t c u b e w r i t i n g data( &write set , format, comm ) ) { r e t u r n ; } s c o r e p p r o f i l e i n i t layout( &write set , &layout );

i f ( w r i t e s e t . my rank == write s e t . r o o t r a n k ) { /∗ generate header ∗/ c u b e d e f a t t r ( w r i t e s e t . my cube, ”Creator”, ”Score P ” PACKAGE VERSION ) ; c u b e d e f a t t r ( w r i t e s e t . my cube , ”CUBE CT AGGR” , ”SUM” ) ; c u b e d e f mirror( write s e t . my cube, ”file ://” DOCDIR ”/profile/” ); c u b e d e f mirror( write s e t . my cube, ”http://www.vi hps.org/upload/packages/scorep/” );

i f ( SCOREP IsUnwindingEnabled() ) { /∗ ∗ Record the number of sampling related definitions in the profile. ∗ The names directly correspond to the OTF2 definition names. ∗/ char buffer[ 32 ]; sprintf( buffer , ”%u”, scorep u n i f i e d d e f i n i t i o n m a n a g e r > c a l l i n g context.counter ); c u b e d e f a t t r ( w r i t e s e t . my cube, ”Score P:: DefinitionCounters :: CallingContext”, buffer ); sprintf( buffer , ”%u”, scorep u n i f i e d d e f i n i t i o n m a n a g e r > i n t e r r u p t generator.counter ); c u b e d e f a t t r ( w r i t e s e t . my cube, ”Score P:: DefinitionCounters :: InterruptGenerator”, buffer ); }

a d d d e f a u l t s p e c f i l e ( w r i t e s e t . my cube ) ; }

/∗ Write definitions to cube ∗/ UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ”Writing definitions” ); s c o r e p w r i t e definitions t o cube4( write s e t . my cube , w r i t e s e t . map , w r i t e s e t . ranks number , w r i t e s e t . g l o b a l i t e m s , w r i t e s e t . i t e m s p e r r a n k , &l a y o u t ) ;

/∗ Build mapping from sequence number in unified callpath definitions to profile nodes ∗/ UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ”Create mappings” ); a d d m a p p i n g t o c u b e w r i t i n g data( &write s e t ) ;

/∗ Write clustering mappings ∗/ s c o r e p c l u s t e r w r i t e cube4( &write s e t ) ;

/∗ dense metrics ∗/

/∗ Write implicit time and visits ∗/ UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ”Writing runtime” );

if ( layout.dense m e t r i c t y p e == SCOREP CUBE DATA TUPLE ) { w r i t e c u b e c u b e t y p e t a u atomic( &write s e t , comm , s c o r e p g e t s u m t i m e h a n d l e ( ) , &g e t t i m e tuple , NULL );

77 if ( layout.metric l i s t & SCOREP CUBE METRIC VISITS ) { w r i t e c u b e c u b e t y p e t a u atomic( &write s e t , comm , s c o r e p g e t v i s i t s h a n d l e ( ) , &g e t v i s i t s tuple , NULL ); }

i f ( SCOREP IsUnwindingEnabled() ) { w r i t e c u b e c u b e t y p e t a u atomic( &write s e t , comm , s c o r e p g e t h i t s h a n d l e ( ) , &g e t h i t s tuple , NULL ); } } e l s e { w r i t e c u b e doubles( &write s e t , comm , s c o r e p g e t s u m t i m e h a n d l e ( ) , &g e t s u m t i m e value , NULL );

w r i t e c u b e doubles( &write set , comm, scorep g e t m a x t i m e h a n d l e ( ) , &g e t m a x t i m e value , NULL );

w r i t e c u b e doubles( &write set , comm, scorep g e t m i n t i m e h a n d l e ( ) , &g e t m i n t i m e value , NULL );

if ( layout.metric l i s t & SCOREP CUBE METRIC VISITS ) { w r i t e c u b e uint64( &write s e t , comm , s c o r e p g e t v i s i t s h a n d l e ( ) , &g e t v i s i t s value , NULL ); }

i f ( SCOREP IsUnwindingEnabled() ) { w r i t e c u b e uint64( &write set , comm, scorep g e t h i t s h a n d l e ( ) , &g e t h i t s value , NULL ); } }

if ( layout.metric l i s t & SCOREP CUBE METRIC NUM THREADS ) { w r i t e c u b e doubles( &write set , comm, scorep g e t n u m t h r e a d s h a n d l e ( ) , &g e t n u m b e r o f threads , NULL ); }

/∗ Write additional dense metrics (e.g. hardware counters) ∗/ UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ”Writing dense metrics” ); f o r ( u i n t 8 t i = 0 ; i < SCOREP Metric GetNumberOfStrictlySynchronousMetrics (); i++ ) { c u b e m e t r i c ∗ metric =NULL; /∗ Only used on rank 0 ∗/ SCOREP MetricHandle metric handle = SCOREP M e t r i c GetStrictlySynchronousMetricHandle( i );

i f ( w r i t e s e t . my rank == write s e t . r o o t r a n k ) { metric = scorep g e t c u b e 4 metric( write s e t . map , SCOREP MetricHandle GetUnified( metric h a n d l e ) ) ; }

/∗ When writing sparse metrics , we skip the time metric handles. Thus, invalidate these entries to avoid writing them twice. ∗/ i f ( w r i t e s e t . metric map != NULL ) { u i n t 3 2 t c u r r e n t n u m b e r = SCOREP MetricHandle GetUnifiedId( metric h a n d l e ) ; w r i t e s e t . metric map[ current n u m b e r ] = SCOREP PROFILE DENSE METRIC ; }

if ( layout.dense m e t r i c t y p e == SCOREP CUBE DATA TUPLE ) { w r i t e c u b e c u b e t y p e t a u atomic( &write set , comm, metric , &g e t m e t r i c t u p l e f r o m a r r a y , &i ) ; } e l s e { w r i t e c u b e uint64( &write set , comm, metric , &g e t m e t r i c s v a l u e f r o m a r r a y , &i ) ; } }

/∗ sparse metrics ∗/

/∗ Write sparse metrics (e.g. user metrics) ∗/ UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ”Writing sparse metrics” ); i f ( w r i t e s e t . metric map != NULL ) { c u b e m e t r i c ∗ metric = NULL; /∗ Only used on rank 0 ∗/

f o r ( u i n t 3 2 t i = 0 ; i < w r i t e s e t . n u m u n i f i e d metrics; i++ )

78 { i f ( ! c h e c k i f m e t r i c s h a l l b e written( &write s e t , w r i t e s e t . metric map [ i ] ) ) { c o n t i n u e ; }

i f ( w r i t e s e t . my rank == write s e t . r o o t r a n k ) { metric = scorep g e t c u b e 4 metric( write s e t . map , w r i t e set .unified m e t r i c m a p [ i ] ) ; }

i f ( w r i t e s e t . metric map [ i ] == SCOREP INVALID METRIC ) { s e t b i t s t r i n g f o r u n k n o w n metric( &write s e t , comm ) ; if ( layout.sparse m e t r i c t y p e == SCOREP CUBE DATA TUPLE ) { w r i t e c u b e c u b e t y p e t a u atomic( &write s e t , comm , m e t r i c , &g e t s p a r s e t u p l e v a l u e f r o m d o u b l e , &w r i t e s e t . metric map [ i ] ) ; } e l s e { w r i t e c u b e doubles( &write s e t , comm , m e t r i c , &g e t s p a r s e d o u b l e v a l u e , &w r i t e s e t . metric map [ i ] ) ; } c o n t i n u e ; }

switch ( SCOREP MetricHandle GetValueType( write s e t . metric map [ i ] ) ) { c a s e SCOREP METRIC VALUE INT64 : c a s e SCOREP METRIC VALUE UINT64 : s e t b i t s t r i n g f o r metric( &write s e t , comm , &g e t s p a r s e u i n t 6 4 v a l u e , &w r i t e s e t . metric map [ i ] ) ; if ( layout.sparse m e t r i c t y p e == SCOREP CUBE DATA TUPLE ) { w r i t e c u b e c u b e t y p e t a u atomic( &write s e t , comm , m e t r i c , &g e t s p a r s e t u p l e v a l u e f r o m u i n t 6 4 , &w r i t e s e t . metric map [ i ] ) ; } e l s e { w r i t e c u b e uint64( &write s e t , comm , m e t r i c , &g e t s p a r s e u i n t 6 4 v a l u e , &w r i t e s e t . metric map [ i ] ) ; }

b r e a k ; c a s e SCOREP METRIC VALUE DOUBLE : s e t b i t s t r i n g f o r metric( &write s e t , comm , &h a s s p a r s e d o u b l e v a l u e , &w r i t e s e t . metric map [ i ] ) ; if ( layout.sparse m e t r i c t y p e == SCOREP CUBE DATA TUPLE ) { w r i t e c u b e c u b e t y p e t a u atomic( &write s e t , comm , m e t r i c , &g e t s p a r s e t u p l e v a l u e f r o m d o u b l e , &w r i t e s e t . metric map [ i ] ) ; } e l s e { w r i t e c u b e doubles( &write s e t , comm , m e t r i c , &g e t s p a r s e d o u b l e v a l u e , &w r i t e s e t . metric map [ i ] ) ; }

b r e a k ; d e f a u l t : UTILS ERROR( SCOREP ERROR UNKNOWN TYPE, ”Metric %s has unknown value type %d”, SCOREP MetricHandle GetName( write s e t . metric map [ i ] ) , SCOREP MetricHandle GetValueType( write s e t . metric map [ i ] ) ) ; } }

79 } UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ”Profile writing done” );

/∗ Clean up ∗/ UTILS DEBUG PRINTF( SCOREP DEBUG PROFILE, ” Clean up” ) ; d e l e t e c u b e w r i t i n g data( &write s e t ) ; i f ( SCOREP Env UseSystemTreeSequence() ) { s c o r e p s y s t e m t r e e s e q f r e e i p c group( comm ); } }

/∗∗ Initializes a scorep c u b e w r i t i n g data object. @param writeSet The data structure that is initialized. Must already be allocated. @param comm Communicator of all ranks in the order they are written to Ccp indexUBE . @returns whether initialization was successful. ∗/ s t a t i c b o o l i n i t c u b e w r i t i n g data( scorep c u b e w r i t i n g d a t a ∗ w r i t e S e t , SCOREP Profile OutputFormat format , SCOREP Ipc Group∗ comm ) { /∗ Set all pointers to zero. If an malloc fails , we know how many can bee freed ∗/ w r i t e S e t > my cube = NULL ; w r i t e S e t > i d 2 n o d e = NULL ; w r i t e S e t >map = NULL ; w r i t e S e t > i t e m s p e r r a n k = NULL ; w r i t e S e t > o f f s e t s p e r r a n k = NULL; w r i t e S e t > metric map = NULL ; w r i t e S e t > u n i f i e d m e t r i c m a p = NULL ; w r i t e S e t > b i t vector =NULL;

/∗ Start initializing ∗/

w r i t e S e t > format = format;

/∗ Get basic MPI data ∗/ w r i t e S e t > my rank =SCOREP IpcGroup GetRank( comm ); w r i t e S e t > l o c a l threads = scorep p r o f i l e g e t n u m b e r o f t h r e a d s ( ) ; w r i t e S e t > ranks number = SCOREP IpcGroup GetSize( comm );

/∗ Get root rank ∗/ w r i t e S e t > r o o t rank = writeSet > my rank ; SCOREP Ipc Bcast( &writeSet > r o o t rank , 1, SCOREP IPC UINT32 T , 0 ) ;

/∗ Calculate the local number of items ∗/ w r i t e S e t > l o c a l i t e m s = s c o r e p p r o f i l e g e t a g g r e g a t e d items( writeSet > l o c a l threads , format );

/∗ Get the number of unified callpath definitions to all ranks ∗/ if ( writeSet > my rank == writeSet > r o o t r a n k ) { w r i t e S e t > c a l l p a t h number = SCOREP Definitions GetNumberOfUnifiedCallpathDefinitions (); } SCOREP IpcGroup Bcast( comm, &writeSet > c a l l p a t h n u m b e r , 1 , SCOREP IPC UINT32 T , w r i t e S e t > r o o t r a n k ) ; if ( writeSet > c a l l p a t h n u m b e r == 0 ) { return false; }

/∗ Calculate the offsets of all ranks and the number of locations per rank ∗/ if ( writeSet > my rank == writeSet > r o o t r a n k ) { s i z e t b u f f e r size = writeSet > ranks number ∗ sizeof( int ); w r i t e S e t > i t e m s p e r r a n k = ( i n t ∗ )malloc( buffer s i z e ) ; w r i t e S e t > o f f s e t s p e r r a n k = ( i n t ∗ )malloc( buffer s i z e ) ; }

SCOREP IpcGroup Gather( comm, &w r i t e S e t > l o c a l i t e m s , w r i t e S e t > i t e m s p e r r a n k , 1 , SCOREP IPC UINT32 T, w r i t e S e t > r o o t r a n k ) ;

if ( writeSet > my rank == writeSet > r o o t r a n k ) { w r i t e S e t > g l o b a l i t e m s = 0 ; w r i t e S e t > same thread num = 1 ; f o r ( i n t 3 2 t i = 0 ; i < w r i t e S e t > ranks number; ++i ) { if ( writeSet > l o c a l items != writeSet > i t e m s p e r r a n k [ i ] ) { w r i t e S e t > same thread num = 0 ; } w r i t e S e t > o f f s e t s p e r rank[ i ] = writeSet > g l o b a l i t e m s ; w r i t e S e t > g l o b a l items +=writeSet > i t e m s p e r r a n k [ i ] ;

80 } }

/∗ Distribute the global number of locations ∗/ SCOREP IpcGroup Bcast( comm, &writeSet > g l o b a l i t e m s , 1 , SCOREP IPC UINT32 T , w r i t e S e t > r o o t r a n k ) ;

/∗ Distribute whether all ranks have the same number of locations ∗/ SCOREP IpcGroup Bcast( comm, &writeSet > same thread num , 1 , SCOREP IPC UINT32 T , w r i t e S e t > r o o t r a n k ) ;

/∗ Distribute the per rank offsets ∗/ SCOREP IpcGroup Scatter( comm, w r i t e S e t > o f f s e t s p e r r a n k , &w r i t e S e t > o f f s e t , 1 , SCOREP IPC UINT32 T, w r i t e S e t > r o o t r a n k ) ;

/∗ Get number of unified metrics to every rank ∗/ if ( writeSet > my rank == writeSet > r o o t r a n k ) { w r i t e S e t > n u m u n i f i e d metrics = SCOREP Definitions GetNumberOfUnifiedMetricDefinitions (); // w r i t e S e t > n u m u n i f i e d metrics = scorep u n i f i e d d e f i n i t i o n m a n a g e r > metric . counter ; } SCOREP IpcGroup Bcast( comm, &writeSet > n u m u n i f i e d m e t r i c s , 1 , SCOREP IPC UINT32 T , w r i t e S e t > r o o t r a n k ) ;

/∗ Create the mappings from cube to Score P handles and vice versa ∗/ w r i t e S e t >map = s c o r e p c u b e 4 c r e a t e definitions m a p ( ) ; if ( writeSet >map == NULL ) { UTILS ERROR( SCOREP ERROR MEM ALLOC FAILED, ”Failed to allocate memory for definition mapping\n” ”Failed to write Cube 4 profile” );

d e l e t e c u b e w r i t i n g data( writeSet ); return false; }

/∗ Create the Cube writer object and the Cube object ∗/

/∗ Construct cube file name ∗/ c o n s t c h a r ∗ = SCOREP GetExperimentDirName ( ) ; c h a r ∗ filename = NULL; c o n s t c h a r ∗ = scorep p r o f i l e g e t basename ();

filename = ( char ∗ )malloc( strlen( dirname ) + /∗ D i r e c t o r y ∗/ 1 + /∗ separator ’/’ ∗/ strlen( basename ) /∗ basename ∗/ + 1 ) ; /∗ t r a i l i n g ’ \ 0 ’ ∗/ if ( filename == NULL ) { UTILS ERROR POSIX( ”Failed to allocate memory for filename. \ n” ”Failed to write Cube 4 profile” ); d e l e t e c u b e w r i t i n g data( writeSet ); return false; } sprintf( filename , ”%s/%s”, dirname, basename );

/∗ MPI version , Create Cube objects for all processes ∗/ if ( writeSet > my rank == writeSet > r o o t r a n k ) { w r i t e S e t > my cube = c u b e m p i create( filename , CUBE MASTER, w r i t e S e t > my rank, writeSet > ranks number , w r i t e S e t > g l o b a l items , writeSet > l o c a l items , writeSet > offset , CUBE FALSE ) ; } e l s e { w r i t e S e t > my cube = c u b e m p i create( filename , CUBE WRITER, w r i t e S e t > my rank, writeSet > ranks number , w r i t e S e t > g l o b a l items , writeSet > l o c a l items , writeSet > offset , CUBE FALSE ) ; }

free( filename );

/∗ C r e a t e b i t vector with all bits set. Used for dense metrics ∗/ w r i t e S e t > b i t v e c t o r = ( u i n t 8 t ∗ )malloc( SCOREP Bitstring GetByteSize( writeSet > c a l l p a t h n u m b e r ) ) ; UTILS ASSERT ( w r i t e S e t > b i t v e c t o r ) ; SCOREP B i t s t r i n g SetAll( writeSet > b i t vector , writeSet > c a l l p a t h n u m b e r ) ;

/∗ Check whether tasks has been used somewhere ∗/ i n t 3 2 t h a s tasks = scorep p r o f i l e h a s t a s k s ( ) ; w r i t e S e t > h a s t a s k s = 0 ; SCOREP IpcGroup Allreduce( comm, &h a s t a s k s , &w r i t e S e t > h a s t a s k s , 1 , SCOREP IPC INT32 T, SCOREP IPC BOR ) ; return true; }

81 /∗∗ @def SCOREP PROFILE WRITE CUBE METRIC Code to write metric values in cube format. Used to reduce code replication. ∗/ #d e f i n e SCOREP PROFILE WRITE CUBE METRIC( type , TYPE, NUMBER, c u b e t y p e , z e r o ) \ s t a t i c v o i d \ w r i t e c u b e ##c u b e t y p e ( \ s c o r e p c u b e w r i t i n g d a t a ∗ w r i t e S e t , \ SCOREP Ipc Group ∗ comm , \ c u b e m e t r i c ∗ m e t r i c , \ s c o r e p p r o f i l e g e t ##c u b e t y p e## func getValue , \ v o i d ∗ funcData ) \ { \ s c o r e p p r o f i l e n o d e ∗ node = NULL ; \ c u b e c n o d e ∗ cnode =NULL; \ t yp e ∗ a g g r e g a t e d values = NULL; \ t yp e ∗ l o c a l values =NULL; \ t yp e ∗ g l o b a l values =NULL; \ i n t ∗ order =NULL; \ if ( writeSet > c a l l p a t h n u m b e r == 0 ) { \ r e t u r n ; } \ \ l o c a l values =( type ∗ )malloc( writeSet > l o c a l t h r e a d s ∗ sizeof( type ) ); \ a g g r e g a t e d values = ( type ∗ )malloc( writeSet > l o c a l i t e m s ∗ sizeof( type ) ); \ UTILS ASSERT ( l o c a l v a l u e s ) ; \ UTILS ASSERT( aggregated v a l u e s ) ; \ \ /∗ Array of values for one rank for all callpaths ∗/ \ g l o b a l values = ( type ∗ )malloc(writeSet > l o c a l i t e m s ∗ w r i t e S e t > c a l l p a t h n u m b e r ∗ sizeof(type)); \ \ /∗ Array of cnodes order in a file ∗/ \ o r d e r = ( i n t ∗ )malloc( writeSet > c a l l p a t h n u m b e r ∗ sizeof( int ) ); \ \ if ( writeSet > my rank == writeSet > r o o t r a n k ) \ { \ \ /∗ Initialize writing of a new metric ∗/ \ c u b e s e t k n o w n c n o d e s f o r metric( writeSet > my cube, metric , \ ( c h a r ∗ ) w r i t e S e t > b i t v e c t o r ) ; \ \ /∗ Get order of cnodes ∗/ \ c a r r a y ∗ sequence = cube g e t c n o d e s f o r metric( writeSet > my cube, metric ); \ unsigned i =0; \ f o r ( i = 0 ; i < s e q u e n c e > s i z e ; i++ ) \ { \ order[ i ] = ( ( cube c n o d e ∗ ) ( s e q u e n c e > data ) [ i ] ) > i d ; \ } \ } \ \ /∗ Broadcast order to other ranks ∗/ \

82 SCOREP IpcGroup Bcast( comm, order , writeSet > c a l l p a t h number , SCOREP IPC INT, writeSet > r o o t r a n k ) ; \ \ /∗ Iterate over all unified callpathes ∗/ \ unsigned written c n o d e s = 0 ; \ f o r ( u i n t 6 4 t c p i n d e x = 0 ; c p i n d e x < w r i t e S e t > c a l l p a t h n u m b e r ; c p i n d e x++ ) \ { \ i f ( ! S C O R E P B i t s t r i n g IsSet( writeSet > b i t v e c t o r , c p i n d e x ) ) \ { \ c o n t i n u e ; \ } \ f o r ( u i n t 6 4 t t h r e a d i n d e x = 0 ; \ t h r e a d i n d e x < w r i t e S e t > l o c a l threads; thread i n d e x++ ) \ { \ u i n t 6 4 t n o d e index = thread i n d e x ∗ w r i t e S e t > c a l l p a t h number + order[ i n d e x ] ; \ node = writeSet > i d 2 n o d e [ n o d e i n d e x ] ; \ if ( node != NULL ) \ { \ l o c a l values[ thread index ] = getValue( node, funcData ); \ } \ e l s e \ { \ l o c a l values[ thread index ] = zero; \ } \ } \ s c o r e p p r o f i l e a g g r e g a t e ##type( &local v a l u e s , \ &aggregated v a l u e s , \ w r i t e S e t ) ; \ \ /∗ Append array of one callpath to others ∗/ \ memcpy( &global values[ writeSet > l o c a l i t e m s ∗ w r i t t e n c n o d e s ] , a g g r e g a t e d values , writeSet > l o c a l i t e m s ∗ sizeof( type )); \ w r i t t e n cnodes = written c n o d e s + 1 ; \ \ } \ \ SCOREP IpcGroup Barrier( comm ); \ /∗ Write data for all callpaths for a metric ∗/ \ c u b e w r i t e a l l s e v r o w s o f ##c u b e type(writeSet > my cube, metric , global v a l u e s ) ; \ /∗ Clean up ∗/ \ free( global v a l u e s ) ; \ f r e e ( l o c a l v a l u e s ) ; \ free( aggregated v a l u e s ) ; \ free( order ); \ }

/∗ ∗INDENT ON∗ ∗/

/∗∗ @function write c u b e u i n t 6 4 Writes data for the metric @a metric to a cube object. @param writeSet Structure containing write data. @param metric The cube metric handle for the written metric. @param getValue Functionpointer which returns the value for a given profile node.

83 @param funcData Pointer to data that is passed to the @a get value function ∗/ SCOREP PROFILE WRITE CUBE METRIC( u i n t 6 4 t , UINT64 T, 1, uint64, 0 )

/∗∗ @function write c u b e d o u b l e Writes data for the metric @a metric to a cube object. @param writeSet Structure containing write data. @param metric The cube metric handle for the written metric. @param getValue Functionpointer which returns the value for a given profile node. @param funcData Pointer to data that is passed to the @a get value function ∗/ SCOREP PROFILE WRITE CUBE METRIC( double , DOUBLE, 1, doubles , 0.0 )

/∗∗ @function write c u b e c u b e t y p e t a u a t o m i c Writes data for the metric @a metric to a cube object. @param writeSet Structure containing write data. @param metric The cube metric handle for the written metric. @param getValue Functionpointer which returns the value for a given profile node. @param funcData Pointer to data that is passed to the @a get value function ∗/ SCOREP PROFILE WRITE CUBE METRIC( c u b e t y p e t a u atomic , BYTE, sizeof( cube t y p e t a u a t o m i c ) , c u b e t y p e t a u a t o m i c , s c o r e p c u b e t y p e t a u a t o m i c z e r o )

84 A.6 prototype new.c

/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗∗ CUBE http://www.scalasca.org/ ∗∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗∗ Copyright (c) 1998 2018 ∗∗ ∗∗ Forschungszentrum Juelich GmbH, Juelich Supercomputing Centre ∗∗ ∗∗ ∗∗ ∗∗ Copyright (c) 2009 2015 ∗∗ ∗∗ German Research School for Simulation Sciences GmbH, ∗∗ ∗∗ Laboratory for Parallel Programming ∗∗ ∗∗ ∗∗ ∗∗ This software may be modified and distributed under the terms of ∗∗ ∗∗ a BSD style license. See the COPYING file in the package base ∗∗ ∗∗ directory for details. ∗∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/

/∗∗ ∗ \ file prototype n e w . c \ brief Example of using ”libcube4w.a” in a parallel way. One creates a cube file ”example.cubex”, defined structure of metrics , call tree , machine, cartesian topology and fills the cube withdata. C a l l ∗ w r i t e a l l rows writes rows in parallel to the cube repors directly on disk. ∗/

#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #include ”cubew cube . h” #include ”cubew services .h”

/∗ used in static array initialization. see below by topology definition ∗/ #d e f i n e NDIMS1 2 #d e f i n e NDIMS2 2 #d e f i n e NDIMS3 4 #d e f i n e NDIMS4 14

#i f n d e f MAXNTHREADS #d e f i n e MAXNTHREADS 64 #e n d i f

#i f n d e f NMETRICS #d e f i n e NMETRICS 7 #e n d i f

#i f n d e f CNODES #define CNODES 10000 #e n d i f

static int current random = 0; // uswd to create a stable random loooking sequence of numbers (used for creating of a calltree) s t a t i c i n t MAXDEPTH = 1 0 0 ;

v o i d c r e a t e s y s t e m t r e e ( c u b e t ∗ cube , c u b e p r o c e s s ∗∗ p r o c e s s , c u b e t h r e a d ∗∗ thread , c u b e n o d e ∗ node , i n t s i z e ) ;

v o i d f i l l c u b e t y p e d o u b l e r o w ( d o u b l e ∗ row , unsigned size , double value ); v o i d c r e a t e a r r a y o f doubles( cube t ∗ cube , c u b e m e t r i c ∗ met , d o u b l e ∗ sev , int nthreads, int ncallpaths, c h a r ∗ bitmask ) ;

v o i d f i l l c u b e t y p e i n t r o w ( i n t 6 4 t ∗ row , unsigned size , i n t v a l u e ) ; v o i d c r e a t e a r r a y o f i n t ( c u b e t ∗ cube , c u b e m e t r i c ∗ met , i n t 6 4 t ∗ sev , int nthreads, int ncallpaths, c h a r ∗ bitmask );

85 v o i d f i l l c u b e t y p e t a u a t o m i c r o w ( c u b e t y p e t a u a t o m i c ∗ row , unsigned size, unsigned value); v o i d c r e a t e a r r a y o f t a u atomic( cube t ∗ cube , c u b e m e t r i c ∗ met , c u b e t y p e t a u a t o m i c ∗ sev , int nthreads, int ncallpaths, c h a r ∗ bitmask ) ;

v o i d c r e a t e calltree( cube t ∗ cube , c u b e r e g i o n ∗∗ regn , c u b e c n o d e ∗∗ codes , i n t ∗ idx , i n t ∗ r e g i d x , c u b e c n o d e ∗ parent , i n t l e v e l ) ; i n t g e t n e x t r a n d o m ( ) ; v o i d s e t s e e d random( int s ); v o i d c a l c u l a t e m a x d e p t h ( ) ; i n t c a l c a u l a t e sum2( int start , int finish ); c h a r ∗ c r e a t e bitmask( int cnodes, c h a r ∗ bitmask ) ;

// / Start point of this example. i n t main( int argc, char ∗ a rg v [ ] ) { i n t rank ; i n t s i z e ; MPI Init( &argc, &argv ); MPI Comm rank ( MPI COMM WORLD, &rank ) ; MPI Comm size ( MPI COMM WORLD, &s i z e ) ; if ( rank == 0 ) { printf( ”Number of processes: %d\n ” , s i z e ) ; } c a l c u l a t e m a x d e p t h ( ) ;

char cubefile[ 17 ] = ”mpi complex cube”; /∗ Create a cube with name cubefile. Root uses CUBE MASTER a l l o t h e r s CUBE SLAVE . ∗/

int nthreads = ( rank ) % ( MAXNTHREADS )+1; i n t t h r d s i n c y c l e = ( MAXNTHREADS ) ∗ ( MAXNTHREADS + 1 ) / 2 ; i n t o f f s e t = ( rank ) / ( MAXNTHREADS )∗ thrdsincycle + ( nthreads 1 ) ∗ nthreads / 2; int totalnthreads = ( ( size / ( MAXNTHREADS ) ) ∗ thrdsincycle ) + ( ( size % MAXNTHREADS ) ∗ ( ( s i z e % MAXNTHREADS ) + 1 ) ) / 2 ;

double tstart, t0, t1, t2, t3, t4, t5, t6, t7, tend; if ( rank == 0 ) { t s t a r t = MPI Wtime ( ) ; } c u b e t ∗ cube ; if ( rank == 0 ) { cube = c u b e m p i create( cubefile , CUBE MASTER, rank, size , totalnthreads , nthreads , offset , CUBE FALSE ) ; } e l s e { cube = c u b e m p i create( cubefile , CUBE WRITER, rank, size , totalnthreads , nthreads , offset , CUBE FALSE ) ; }

i f ( ! cube ) { fprintf( stderr , ”cube create failed! \ n” ) ; e x i t ( 1 ) ; } /∗ S i x m e t r i c s ∗/ c u b e m e t r i c ∗∗ m e t vector = ( cube m e t r i c ∗∗ )calloc( NMETRICS, sizeof( cube m e t r i c ∗ ));

if ( rank == 0 ) { /∗ generate header ∗/ c u b e d e f mirror( cube, ”http://www.fz juelich.de/jsc/scalasca/” ); c u b e d e f attr( cube, ”description”, ”A simple example of Cube report” ); c u b e d e f attr( cube, ”experiment time”, ”November 1st , 2004” );

printf( ”Test file %s.cubex is being generated ... \ n”, cubefile );

/∗ root metric Time ∗/

86 m e t vector[ 0 ] = cube d e f met( cube, ”Time”, ”time”, ”FLOAT”, ”sec”, ””, ”@mirror@patterns 2.1. html#execution”, ”root node”, NULL, CUBE METRIC INCLUSIVE ) ; /∗ A metric of visits ∗/ m e t vector[ 1 ] = cube d e f met( cube, ”Visits”, ”visits”, ”INTEGER”, ”occ”, ””, ”http://www. cs .utk.edu/usr .html”, ”Number of visits”, NULL, CUBE METRIC EXCLUSIVE ) ; /∗ Another root metric ∗/ m e t vector[ 2 ] = cube d e f met( cube, ”P2P bytes sent”, ”bytes sent”, ”INTEGER”, ”bytes”, ””, ”http://www. cs .utk.edu/sys .html”, ”Number of bytes sent in point to point operations”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 3 ] = cube d e f met( cube, ”P2P bytes received”, ”bytes rcvd”, ”INTEGER”, ”bytes”, ””, ”http://www. cs .utk.edu/sys .html”, ”Number of bytes received in point to point operations”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 4 ] = cube d e f met( cube, ”Tasks time”, ”time in tasks”, ”MAXDOUBLE”, ”sec”, ””, ”http://www. cs .utk.edu/sys .html”, ”Time spent in tasks.”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 5 ] = cube d e f met( cube, ”Counters”, ”number of counters”, ”TAU ATOMIC”, ”occ”, ””, ”http://www. cs .utk.edu/sys .html”, ”Number of counters”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 6 ] = cube d e f met( cube, ”Memory allocation”, ”memoryallocation”, ”TAU ATOMIC” , ” t a u statistics”, ””, ”@mirror@patterns 2.1. html#memoryallocation”, ”Total allocated memory”, NULL, CUBE METRIC EXCLUSIVE ) ;

/∗ define call tree ∗/ /∗ definition of regions in calltree ∗/ c h a r ∗ mod = ”/ICL/CUBE/example . c ”;

c u b e r e g i o n ∗∗ r e g n = ( c u b e r e g i o n ∗∗ )calloc( 169, sizeof( cube r e g i o n ∗ ));

c h a r ∗ names[ 169 ] = { ”UNKNOWN” , ”TRACING” , ” MPI Accumulate” , ”MPI Allgather”, ”MPI Allgatherv”, ”MPI Allreduce”, ”MPI A l l t o a l l ” , ”MPI Alltoallv”, ”MPI Alltoallw”, ”MPI B a r r i e r ” , ”MPI Bcast ” , ”MPI Bsend ” , ”MPI B s e n d i n i t ” , ” MPI Cancel”, ”MPI C a r t c r e a t e ” , ”MPI Comm create”, ”MPI Comm dup ” , ”MPI Comm free ” , ”MPI Comm group ” , ”MPI Comm remote group ” , ”MPI Comm split ” , ”MPI Exscan ” , ”MPI F i l e c l o s e ” , ”MPI F i l e d e l e t e ” , ”MPI F i l e g e t amode”, ”MPI F i l e g e t atomicity”, ”MPI F i l e g e t b y t e o f f s e t ” , ”MPI F i l e g e t group”, ”MPI F i l e g e t i n f o ” , ”MPI F i l e g e t p o s i t i o n ” , ”MPI F i l e g e t p o s i t i o n s h a r e d ” , ”MPI F i l e g e t s i z e ” , ”MPI F i l e g e t t y p e extent”, ”MPI F i l e g e t v i e w ” , ” M P I F i l e i r e a d ” , ”MPI F i l e i r e a d a t ” , ” M P I F i l e i r e a d s h a r e d ” , ”MPI F i l e i w r i t e ” , ”MPI F i l e i w r i t e a t ” , ”MPI F i l e i w r i t e s h a r e d ” , ”MPI F i l e o p e n ” , ”MPI F i l e preallocated”, ”MPI F i l e r e a d ” , ” M P I F i l e r e a d a l l ” , ” M P I F i l e r e a d a l l b e g i n ” , ”MPI F i l e r e a l a l l e n d ” , ” M P I F i l e r e a d a t ” , ”MPI F i l e r e a d a t a l l ” , ”MPI F i l e r e a d a t a l l b e g i n ” , ”MPI F i l e r e a d a t a l l e n d ” , ”MPI F i l e r e a d o r d e r e d ” , ”MPI F i l e r e a d o r d e r e d begin”, ”MPI F i l e r e a d o r d e r e d e n d ” , ” M P I F i l e r e a d s h a r e d ” , ”MPI F i l e s e e k ” , ”MPI F i l e s e e k s h a r e d ” , ”MPI F i l e s e t atomicity”, ”MPI F i l e s e t i n f o ” , ”MPI F i l e s e t s i z e ” , ”MPI F i l e s e t v i e w ” , ”MPI F i l e s y n c ” , ” M P I f i l e write”, ”MPI F i l e w r i t e a l l ” , ”MPI F i l e w r i t e a l l begin”, ”MPI F i l e w r i t e a l l e n d ” , ”MPI F i l e w r i t e a t a l l ” , ”MPI F i l e w r i t e a t a l l b e g i n ” , ”MPI F i l e w r i t e a t a l l e n d ” , ”MPI F i l e w r i t e o r d e r e d ” , ”MPI F i l e w r i t e o r d e r e d begin”, ”MPI F i l e w r i t e o r d e r e d e n d ” , ” M P I F i l e w r i t e s h a r e d ” , ”MPI F i n a l i z e ” , ”MPI Gather ” , ” mPI Gatherv ” , ”MPI Get ” , ”MPI create ” , ”MPI G r o u p difference”, ”MPI Group excl ” , ”MPI Group f r e e ” , ”MPI G r o up i n c l ” , ” M P I G r o u p intersection”, ”MPI G r o u p r a n g e e x c l ” , ”MPI G r o u p r a n g e i n c l ” , ” MPI Group union ” , ”MPI Ibsend ” , ”MPI I n i t ” , ”MPI I n i t t h r e a d ” , ”MPI I n t e r c o m m c r e a t e ” , ”MPI Intercomm merge ” , ”MPI I r e c v ” , ”MPI Irsend”, ”MPI Isend”, ”MPI Issend ” , ”MPI Probe ” , ”MPI Put ” , ”MPI Recv ” , ”MPI R e c v i n i t ” , ”MPI Reduce ” , ”MPI R e d u c e s c a t t e r ” , ”MPI R e q u e s t f r e e ” ,

87 ”MPI Rsend ” , ”MPI R s e n d i n i t ” , ”MPI Scan ” , ”MPI S c a t t e r ” , ”MPI S c a t t e r v ” , ”MPI Send ” , ” M P I S e n d i n i t ” , ” MPI Sendrecv ” , ”MPI S e n d r e c v ”, ”MPI Ssend ” , ”MPI S s e n d i n i t ” , ”MPI Start ” , ”MPI S t a r t a l l ” , ”MPI Test ” , ”MPI T e s t a l l ” , ”MPI Testany ” , ”MPI Testsome ” , ”MPI Wait ” , ”MPI Waitall ” , ”MPI Waitany”, ”MPI Waitsome”, ”MPI Win complete ” , ”MPI Win create ” , ”MPI Win fence ” , ”MPI Win free ” , ”MPI Win lock ” , ”MPI Win post ” , ”MPI Win start ” , ”MPI Win test ” , ”MPI Win unlock ” , ”MPI Win wait ” , ”PARALLEL” , ” d r i v e r ” , ” t a s k init”, ”bcast init”, ”barriere s y n c ” , ” r e a d i n p u t ” , ” b c a s t r e a l ” , ”decomp ” , ” i n n e r a u t o ” , ” i n n e r ” , ”initialize”, ” i n i t x s ” , ” i n i t s n c ” , ” o c t a n t ” , ”initgeom”, ” t i m e r s ” , ” i t s =1 ” , ”source”, ”sweep”, ” r c v r e a l ” , ” s n d real”, ”global i n i t s u m ” , ” f l u x e r r ” , ” g l o b a l r e a l m a x ” , ” i t s =2 ” , ” i t s =3 ” , ” i t s =4 ” , ” i t s =5 ” , ” i t s =6 ” , ” i t s =7 ” , ” i t s =8 ” , ” i t s =9 ” , ” i t s =10 ” , ” i t s =11 ” , ”its=12 ”,”global r e a l s u m ” , ” t a s k e n d ” } ;

u n s i g n e d i ; f o r ( i = 0 ; i < 1 6 9 ; i++ ) { c h a r ∗ d e s c r = ” ” ; i f ( i > 1 && i < 132 ) { descr = ”MPI”; } i f ( i == 1 | | i == 132 ) { descr = ”EPIK”; } i f ( i > 132 ) { descr = ”USR”; } regn[ i ] = cube d e f region( cube, names[ i ], names[ i ], ”mpi”, ”barrier”, 1, 1, ””, descr, mod ); }

/∗ actual definition of calltree ∗/ s r a n d ( 0 ) ; c u b e c n o d e ∗∗ c n o d e s vector = ( cube c n o d e ∗∗ )calloc( CNODES, sizeof( cube c n o d e ∗ )); i n t i d x = 0 ; i n t r e g i o n i d x = 0 ; c r e a t e calltree( cube, regn, cnodes vector , &idx , ®ion idx, NULL, 0 ); /∗ define location tree ∗/ c ub e m ac h in e ∗ mach = c u b e d e f mach( cube, ”MSC<>”, ”” ) ; c u b e n o d e ∗ node = c u b e d e f node( cube, ”Athena<>”, mach ) ; c u b e p r o c e s s ∗∗ p r o c e s s e s vector = ( cube p r o c e s s ∗∗ )calloc( size , sizeof( cube p r o c e s s ∗ )); c u b e t h r e a d ∗∗ t h r e a d s vector = ( cube t h r e a d ∗∗ )calloc( size ∗ totalnthreads , sizeof( cube t h r e a d ∗ )); /∗ Define processes and threads ∗/ c r e a t e s y s t e m tree( cube, processes vector , threads vector, node, size ); free( regn );

c u b e s e t s t a t i s t i c name( cube, ”mystat” );

c u b e s e t m e t r i c s title( cube, ”Metric tree (QMCD App )”); c u b e s e t c a l l t r e e title( cube, ”Calltree (serial run, #2345)” ); c u b e s e t s y s t e m t r e e title( cube, ”System ( Linux cluster < 43 & 23 >)”);

88 /∗ generate header ∗/ c u b e d e f mirror( cube, ”http:// icl .cs.utk.edu/software/kojak/” ); c u b e d e f mirror( cube, ”http://www.fz juelich.de/jsc/kojak/” ); c u b e d e f attr( cube, ”description”, ”a simple example” ); c u b e d e f attr( cube, ”experiment time”, ”November 1st , 2004” ); c u b e e n a b l e f l a t tree( cube, CUBE FALSE ) ; }

if ( rank == 0 ) { t 0 = MPI Wtime ( ) ; printf( ”Definitions: %f \n”, t0 tstart ); }

MPI Barrier( MPI COMM WORLD ) ;

i n t n c a l l p a t h s ; i n t b i t m a s k size = ( CNODES+ 7 ) / 8; c h a r ∗ bitmask = malloc( bitmask s i z e ∗ sizeof( char ) );

/∗ First metric ∗/ d o u b l e ∗ sev0 = malloc( nthreads ∗ CNODES ∗ sizeof( double ) ); // f i l l c u b e t y p e d o u b l e row(sev0 , nthreads ∗ CNODES, 3 ) ; c r e a t e a r r a y o f doubles( cube, met vector[ 0 ], sev0, nthreads, CNODES, bitmask ); c u b e w r i t e a l l s e v r o w s o f doubles( cube, met vector[ 0 ], sev0 ); free( sev0 );

if ( rank == 0 ) { t 1 = MPI Wtime ( ) ; printf( ”1. metric: %f \n ” , t 1 t 0 ) ; }

/∗ Second metric ∗/ i n t 6 4 t ∗ sev1 = malloc( nthreads ∗ CNODES ∗ sizeof( int64 t ) ) ; // f i l l c u b e t y p e i n t row( sev1, nthreads ∗ CNODES, 3 ) ; c r e a t e a r r a y o f int( cube, met vector[ 1 ], sev1, nthreads, CNODES, bitmask ); c u b e w r i t e a l l s e v r o w s o f int64( cube, met vector[ 1 ], sev1 ); free( sev1 );

if ( rank == 0 ) { t 2 = MPI Wtime ( ) ; printf( ”2. metric: %f \n ” , t 2 t 1 ) ; }

/∗ Third metric 5% sparse ∗/ n c a l l p a t h s = CNODES ∗ 0 . 0 5 ; c r e a t e b i t m a s k ( n callpaths , bitmask ); if ( rank == 0 ) { c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 2 ], bitmask ); } i n t 6 4 t ∗ sev2 = malloc( nthreads ∗ n c a l l p a t h s ∗ sizeof( int64 t ) ) ; // f i l l c u b e t y p e i n t row( sev2, nthreads ∗ n callpaths , 3 ); c r e a t e a r r a y o f int( cube, met vector[ 2 ], sev2, nthreads, n callpaths , bitmask ); c u b e w r i t e a l l s e v r o w s o f int64( cube, met vector[ 2 ], sev2 ); free( sev2 );

if ( rank == 0 ) { t 3 = MPI Wtime ( ) ; printf( ”3. metric: %f \n ” , t 3 t 2 ) ; }

/∗ Fourth metric 5% sparse ∗/ c r e a t e b i t m a s k ( n callpaths , bitmask ); if ( rank == 0 ) { c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 3 ], bitmask ); } i n t 6 4 t ∗ sev3 = malloc( nthreads ∗ n c a l l p a t h s ∗ sizeof( int64 t ) ) ; // f i l l c u b e t y p e i n t row( sev3, nthreads ∗ n callpaths , 3 ); c r e a t e a r r a y o f int( cube, met vector[ 3 ], sev3, nthreads, n callpaths , bitmask ); c u b e w r i t e a l l s e v r o w s o f int64( cube, met vector[ 3 ], sev3 ); free( sev3 );

if ( rank == 0 ) { t 4 = MPI Wtime ( ) ; printf( ”4. metric: %f \n ” , t 4 t 3 ) ; }

/∗ Fifth metric 20% sparse ∗/ n c a l l p a t h s = CNODES ∗ 0 . 2 ; c r e a t e b i t m a s k ( n callpaths , bitmask ); if ( rank == 0 ) {

89 c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 4 ], bitmask ); } d o u b l e ∗ sev4 = malloc( nthreads ∗ n c a l l p a t h s ∗ sizeof( double ) ); // f i l l c u b e t y p e d o u b l e row( sev4, nthreads ∗ n callpaths , 3 ); c r e a t e a r r a y o f doubles( cube, met vector[ 4 ], sev4, nthreads, n callpaths , bitmask ); c u b e w r i t e a l l s e v r o w s o f doubles( cube, met vector[ 4 ], sev4 ); free( sev4 );

if ( rank == 0 ) { t 5 = MPI Wtime ( ) ; printf( ”5. metric: %f \n ” , t 5 t 4 ) ; }

/∗ sixth metric ∗/ c u b e t y p e t a u a t o m i c ∗ sev5 = malloc( nthreads ∗ CNODES ∗ sizeof( cube t y p e t a u a t o m i c ) ) ; // f i l l c u b e t y p e t a u a t o m i c row( sev5, nthreads ∗ CNODES, 1 ) ; c r e a t e a r r a y o f t a u atomic( cube, met vector[ 5 ], sev5, nthreads, CNODES, bitmask ); c u b e w r i t e a l l s e v r o w s o f c u b e t y p e t a u atomic( cube, met vector[ 5 ], sev5 ); free( sev5 );

if ( rank == 0 ) { t 6 = MPI Wtime ( ) ; printf( ”6. metric: %f \n ” , t 6 t 5 ) ; }

/∗ Seventh metric 2% sparse ∗/ n c a l l p a t h s = CNODES ∗ 0 . 2 ; c r e a t e b i t m a s k ( n callpaths , bitmask ); if ( rank == 0 ) { c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 6 ], bitmask ); } c u b e t y p e t a u a t o m i c ∗ sev6 = malloc( nthreads ∗ n c a l l p a t h s ∗ sizeof( cube t y p e t a u a t o m i c ) ) ; // f i l l c u b e t y p e t a u a t o m i c row( sev6, nthreads ∗ n callpaths , 1 ); c r e a t e a r r a y o f t a u atomic( cube, met vector[ 6 ], sev6, nthreads, n callpaths , bitmask ); c u b e w r i t e a l l s e v r o w s o f c u b e t y p e t a u atomic( cube, met vector[ 6 ], sev6 ); free( sev6 );

if ( rank == 0 ) { t 7 = MPI Wtime ( ) ; printf( ”7. metric: %f \n ” , t 7 t 6 ) ; }

free( bitmask );

f r e e ( m e t v e c t o r ) ; /∗ D e l e t e c u b e t structure and release all memory∗/ c u b e free( cube );

if ( rank == 0 ) { tend = MPI Wtime ( ) ; printf( ”Anchor file: %f \n”, tend t7 ); printf( ”All metrics: %f \n ” , t 7 t 0 ) ; printf( ”Alltogether: %f \n”, tend tstart ); }

if ( rank == 0 ) { printf( ”Test file %s.cubex complete. \ n”, cubefile ); } MPI Finalize (); /∗ f i n i s h ∗/ r e t u r n 0 ; }

i n t g e t n e x t r a n d o m ( ) { c u r r e n t random = ( current random + 5 ) % 13; return current r a n d o m ; }

v o i d s e t s e e d random( int s ) { c u r r e n t r a n d o m = s ; }

v o i d c r e a t e s y s t e m t r e e ( c u b e t ∗ cube , c u b e p r o c e s s ∗∗ process , cube t h r e a d ∗∗ thread , cube n o d e ∗ node, int size ) { unsigned i, j, nthrds, totthreads; n t h r d s = 1 ; totthreads = 0; f o r ( i = 0 ; i < s i z e ; i++ ) {

90 char proc[] =”Process”; char thrd[] =”Thread”; c u b e p r o c e s s ∗ p r o c 1 = c u b e d e f proc( cube, proc, i, node ); process[ i ] = proc1; f o r ( j = 0 ; j < nthrds; j++ ) { //printf(”id %d, i %d, j %d, threads %d \n”, totthreads, i, j, nthrds); c u b e t h r e a d ∗ thread1 = cube d e f thrd( cube, thrd, totthreads, process[ i ] ); thread[ totthreads ] = thread1; totthreads++; }

nthrds = ( nthrds == MAXNTHREADS ) ? 1 : nthrds + 1; } }

v o i d c r e a t e calltree( cube t ∗ cube , c u b e r e g i o n ∗∗ regn , c u b e c n o d e ∗∗ cnodes , i n t ∗ idx , i n t ∗ r e g i o n i d x , c u b e c n o d e ∗ parent, int level ) { i f ( ( ∗ i d x ) >= CNODES ) { r e t u r n ; } i f ( l e v e l >= MAXDEPTH ) { r e t u r n ; } i n t n u m children = get n e x t r a n d o m ( ) ; u n s i g n e d i ; f o r ( i = 0 ; i < n u m children; i++ ) { unsigned rand r e g = ∗ r e g i o n i d x ; ∗ r e g i o n i d x = ( ( ∗ r e g i o n idx + 1 ) % 169 ); // cyclic selection of regions c u b e c n o d e ∗ cnode = c u b e d e f cnode( cube, regn[ rand reg ], parent ); c n o d e s [ ∗ idx ] = cnode; ( ∗ i d x )++; c r e a t e calltree( cube, regn, cnodes, idx, region idx, cnode, level + 1 ); i f ( ( ∗ i d x ) >= CNODES ) { return; // for the case, inside of those function was reached the end of array } } }

v o i d c a l c u l a t e m a x d e p t h ( ) { MAXDEPTH = ( i n t ) ( l o g ( CNODES ) / l o g ( 6 ) ) + 3 ; }

v o i d f i l l c u b e t y p e d o u b l e r o w ( d o u b l e ∗ row, unsigned size , double value ) { unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { row[ i ] = value + i; } }

v o i d f i l l c u b e t y p e i n t r o w ( i n t 6 4 t ∗ row, unsigned size , int value ) { unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { row[ i ] = value + i; } }

v o i d f i l l c u b e t y p e t a u a t o m i c r o w ( c u b e t y p e t a u a t o m i c ∗ row, unsigned size , unsigned value ) { unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { double A= ( double )( value + i );

i n t a = ( int )floor( A / 2. ); i n t a = ( int )ceil( A/ 2. );

row [ i ] . N = ( a + a + 1 ) ; row[ i ].Min = a ; row[ i ].Max = a ; row[ i ].Sum = ( a != a ) ? a : 0 ; row[ i ].Sum2 = calcaulate s u m 2 ( a , a ); } }

91 i n t c a l c a u l a t e sum2( int start, int finish ) { i n t sum = 0 ; int i = start; for ( i = start; i <= finish; i++ ) { sum += i ∗ i ; } r e t u r n sum ; }

c h a r ∗ c r e a t e bitmask( int cnodes, char ∗ bitmask ) { memset( bitmask, 0x00, ( CNODES + 7 ) / 8 ); i n t i = 0 ; f o r ( i = 0 ; i < cnodes; i = i +1 ) { c u b e s e t bit( bitmask, i ); } // c u b e s e t k n o w n c n o d e s f o r metric( cube, met, bitmask ); return bitmask; }

v o i d c r e a t e a r r a y o f doubles( cube t ∗ cube , c u b e m e t r i c ∗ met , d o u b l e ∗ sev, int nthreads, int ncallpaths , char ∗ bitmask ) { u n s i g n e d ∗ idtocid = malloc( CNODES ∗ sizeof( unsigned ) ); i f ( cube > c u b e f l a v o u r == CUBE MASTER ) { c a r r a y ∗ sequence = cube g e t c n o d e s f o r metric( cube, met ); unsigned i =0; f o r ( i = 0 ; i < s e q u e n c e > s i z e ; i++ ) { idtocid[ i ] = ( ( cube c n o d e ∗ ) ( s e q u e n c e > data ) [ i ] ) > i d ; } }

MPI Bcast( idtocid , CNODES, MPI UNSIGNED , 0 , MPI COMM WORLD ) ;

unsigned ci; i n t w = 0 ; for ( ci = 0; ci < ncallpaths; ci++ ) { if ( CNODES != ncallpaths && ! c u b e i s b i t set( bitmask, ci ) ) { c o n t i n u e ; }

f i l l c u b e t y p e d o u b l e row( ( double ∗ ) ( s e v + w ∗ nthreads ), nthreads, idtocid[ ci ] + 1 ); w++; } free( idtocid ); }

v o i d c r e a t e a r r a y o f i n t ( c u b e t ∗ cube , c u b e m e t r i c ∗ met , i n t 6 4 t ∗ sev, int nthreads, int ncallpaths , char ∗ bitmask ) { u n s i g n e d ∗ idtocid = malloc( CNODES ∗ sizeof( unsigned ) ); i f ( cube > c u b e f l a v o u r == CUBE MASTER ) { c a r r a y ∗ sequence = cube g e t c n o d e s f o r metric( cube, met ); unsigned i =0; f o r ( i = 0 ; i < s e q u e n c e > s i z e ; i++ ) { idtocid[ i ] = ( ( cube c n o d e ∗ ) ( s e q u e n c e > data ) [ i ] ) > i d ; } }

MPI Bcast( idtocid , CNODES, MPI UNSIGNED , 0 , MPI COMM WORLD ) ;

unsigned ci; i n t w = 0 ; for ( ci = 0; ci < ncallpaths; ci++ ) { if ( CNODES != ncallpaths && ! c u b e i s b i t set( bitmask, ci ) ) { c o n t i n u e ; } f i l l c u b e t y p e i n t r o w ( ( i n t 6 4 t ∗ ) ( s e v + w ∗ nthreads ), nthreads, idtocid[ ci ] + 1 ); w++; } free( idtocid ); }

92 v o i d c r e a t e a r r a y o f t a u atomic( cube t ∗ cube , c u b e m e t r i c ∗ met , c u b e t y p e t a u a t o m i c ∗ sev, int nthreads, int ncallpaths , char ∗ bitmask ) { u n s i g n e d ∗ idtocid = malloc( CNODES ∗ sizeof( unsigned ) ); i f ( cube > c u b e f l a v o u r == CUBE MASTER ) { c a r r a y ∗ sequence = cube g e t c n o d e s f o r metric( cube, met ); unsigned i =0; f o r ( i = 0 ; i < s e q u e n c e > s i z e ; i++ ) { idtocid[ i ] = ( ( cube c n o d e ∗ ) ( s e q u e n c e > data ) [ i ] ) > i d ; } }

MPI Bcast( idtocid , CNODES, MPI UNSIGNED , 0 , MPI COMM WORLD ) ;

unsigned ci; unsigned w = 0; for ( ci = 0; ci < ncallpaths; ci++ ) { if ( CNODES != ncallpaths && ! c u b e i s b i t set( bitmask, ci ) ) { c o n t i n u e ; } f i l l c u b e t y p e t a u a t o m i c r o w ( ( c u b e t y p e t a u a t o m i c ∗ ) ( s e v + w ∗ nthreads ), nthreads, idtocid[ ci ] + 1 ); w++; } free( idtocid ); }

93 A.7 prototype old.c

/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗∗ CUBE http://www.scalasca.org/ ∗∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ ∗∗ Copyright (c) 1998 2018 ∗∗ ∗∗ Forschungszentrum Juelich GmbH, Juelich Supercomputing Centre ∗∗ ∗∗ ∗∗ ∗∗ Copyright (c) 2009 2015 ∗∗ ∗∗ German Research School for Simulation Sciences GmbH, ∗∗ ∗∗ Laboratory for Parallel Programming ∗∗ ∗∗ ∗∗ ∗∗ This software may be modified and distributed under the terms of ∗∗ ∗∗ a BSD style license. See the COPYING file in the package base ∗∗ ∗∗ directory for details. ∗∗ ∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/

/∗∗ ∗ \ file prototype o l d . c \ brief Example of using ”libcube4w.a”. This test case produces the same output as the one form cubew example mpi complex.c, but is to be used with an old library. One creates a cube file ”example.cubex”, defined structure of metrics , call tree , machine, cartesian topology and fills the cube withdata. ∗/

#i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #i n c l u d e #include ”cubew cube . h” #include ”cubew services .h” #include ”cubew t y p e s . h”

/∗ used in static array initialization. see below by topology definition ∗/ #d e f i n e NDIMS1 2 #d e f i n e NDIMS2 2 #d e f i n e NDIMS3 4 #d e f i n e NDIMS4 14

#i f n d e f MAXNTHREADS #d e f i n e MAXNTHREADS 64 #e n d i f

#i f n d e f NMETRICS #d e f i n e NMETRICS 7 #e n d i f

#i f n d e f CNODES #define CNODES 10000 #e n d i f

static int current random = 0; // uswd to create a stable random loooking sequence of numbers (used for creating of a calltree) s t a t i c i n t MAXDEPTH = 1 0 0 ;

v o i d c r e a t e s y s t e m t r e e ( c u b e t ∗ cube , c u b e p r o c e s s ∗∗ p r o c e s s , c u b e t h r e a d ∗∗ thread , c u b e n o d e ∗ node , i n t s i z e ) ;

v o i d f i l l c u b e t y p e d o u b l e r o w ( d o u b l e ∗ row , unsigned size , double value ); v o i d w r i t e d o u b l e s l i n e b y l i n e ( c u b e t ∗ cube , c u b e m e t r i c ∗ met , int nthreads, int totnthreads, int ncallpaths, c h a r ∗ bitmask , i n t ∗ counts , i n t ∗ d i s p l s , i n t rank );

v o i d f i l l c u b e t y p e i n t r o w ( i n t 6 4 t ∗ row , unsigned size , i n t v a l u e ) ; v o i d w r i t e i n t l i n e b y l i n e ( c u b e t ∗ cube , c u b e m e t r i c ∗ met , int nthreads,

94 int totnthreads, int ncallpaths, c h a r ∗ bitmask , i n t ∗ counts , i n t ∗ d i s p l s , i n t rank );

v o i d f i l l c u b e t y p e t a u a t o m i c r o w ( c u b e t y p e t a u a t o m i c ∗ row , unsigned size, unsigned value); v o i d w r i t e t a u a t o m i c l i n e b y l i n e ( c u b e t ∗ cube , c u b e m e t r i c ∗ met , int nthreads, int totnthreads, int ncallpaths, c h a r ∗ bitmask , i n t ∗ counts , i n t ∗ d i s p l s , i n t rank );

v o i d c r e a t e calltree( cube t ∗ cube , c u b e r e g i o n ∗∗ regn , c u b e c n o d e ∗∗ codes , i n t ∗ idx , i n t ∗ r e g i d x , c u b e c n o d e ∗ parent , i n t l e v e l ) ; i n t g e t n e x t r a n d o m ( ) ; v o i d s e t s e e d random( int s ); v o i d c a l c u l a t e m a x d e p t h ( ) ; i n t c a l c a u l a t e sum2( int start , int finish ); c h a r ∗ c r e a t e bitmask( int cnodes, c h a r ∗ bitmask ) ;

// / Start point of this example. i n t main( int argc, char ∗ a rg v [ ] ) { i n t rank ; i n t s i z e ; MPI Init( &argc, &argv ); MPI Comm rank ( MPI COMM WORLD, &rank ) ; MPI Comm size ( MPI COMM WORLD, &s i z e ) ; if ( rank == 0 ) { printf( ”Number of processes: %d\n ” , s i z e ) ; } c a l c u l a t e m a x d e p t h ( ) ;

char cubefile[] = ”mpi complex cube old”; /∗ Create a cube with name cubefile. Root uses CUBE MASTER a l l o t h e r s CUBE SLAVE . ∗/

int nthreads = ( rank ) % ( MAXNTHREADS )+1; i n t t h r d s i n c y c l e = ( MAXNTHREADS ) ∗ ( MAXNTHREADS + 1 ) / 2 ; i n t o f f s e t = ( rank ) / ( MAXNTHREADS )∗ thrdsincycle + ( nthreads 1 ) ∗ nthreads / 2; int totalnthreads = ( ( size / ( MAXNTHREADS ) ) ∗ thrdsincycle ) + ( ( size % MAXNTHREADS ) ∗ ( ( s i z e % MAXNTHREADS ) + 1 ) ) / 2 ;

i n t ∗ counts = malloc( size ∗ sizeof( int ) ); MPI Gather( &nthreads , 1, MPI INT, counts, 1, MPI INT , 0 , MPI COMM WORLD ) ; i n t ∗ displs = malloc( size ∗ sizeof( int ) ); MPI Gather( &offset , 1, MPI INT, displs , 1, MPI INT , 0 , MPI COMM WORLD ) ;

double tstart, t0, t1, t2, t3, t4, t5, t6, t7, tend; if ( rank == 0 ) { t s t a r t = MPI Wtime ( ) ; }

c u b e t ∗ cube ; if ( rank == 0 ) { cube = c u b e create( cubefile , CUBE MASTER, CUBE FALSE ) ; i f ( ! cube ) { fprintf( stderr , ”cube create failed! \ n” ) ; e x i t ( 1 ) ; }

95 }

/∗ S i x m e t r i c s ∗/ c u b e m e t r i c ∗∗ m e t vector = ( cube m e t r i c ∗∗ )calloc( NMETRICS, sizeof( cube m e t r i c ∗ )); c a r r a y ∗ sequence = malloc( CNODES ∗ sizeof( carray ) );

if ( rank == 0 ) { /∗ generate header ∗/ c u b e d e f mirror( cube, ”http://www.fz juelich.de/jsc/scalasca/” ); c u b e d e f attr( cube, ”description”, ”A simple example of Cube report” ); c u b e d e f attr( cube, ”experiment time”, ”November 1st , 2004” );

printf( ”Test file %s.cubex is being generated ... \ n”, cubefile );

/∗ root metric Time ∗/ m e t vector[ 0 ] = cube d e f met( cube, ”Time”, ”time”, ”FLOAT”, ”sec”, ””, ”@mirror@patterns 2.1. html#execution”, ”root node”, NULL, CUBE METRIC INCLUSIVE ) ; /∗ A metric of visits ∗/ m e t vector[ 1 ] = cube d e f met( cube, ”Visits”, ”visits”, ”INTEGER”, ”occ”, ””, ”http://www. cs .utk.edu/usr .html”, ”Number of visits”, NULL, CUBE METRIC EXCLUSIVE ) ; /∗ Another root metric ∗/ m e t vector[ 2 ] = cube d e f met( cube, ”P2P bytes sent”, ”bytes sent”, ”INTEGER”, ”bytes”, ””, ”http://www. cs .utk.edu/sys .html”, ”Number of bytes sent in point to point operations”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 3 ] = cube d e f met( cube, ”P2P bytes received”, ”bytes rcvd”, ”INTEGER”, ”bytes”, ””, ”http://www. cs .utk.edu/sys .html”, ”Number of bytes received in point to point operations”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 4 ] = cube d e f met( cube, ”Tasks time”, ”time in tasks”, ”MAXDOUBLE”, ”sec”, ””, ”http://www. cs .utk.edu/sys .html”, ”Time spent in tasks.”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 5 ] = cube d e f met( cube, ”Counters”, ”number of counters”, ”TAU ATOMIC”, ”occ”, ””, ”http://www. cs .utk.edu/sys .html”, ”Number of counters”, NULL, CUBE METRIC EXCLUSIVE ) ;

m e t vector[ 6 ] = cube d e f met( cube, ”Memory allocation”, ”memoryallocation”, ”TAU ATOMIC” , ” t a u statistics”, ””, ”@mirror@patterns 2.1. html#memoryallocation”, ”Total allocated memory”, NULL, CUBE METRIC EXCLUSIVE ) ;

/∗ define call tree ∗/ /∗ definition of regions in calltree ∗/ c h a r ∗ mod = ”/ICL/CUBE/example . c ”;

c u b e r e g i o n ∗∗ r e g n = ( c u b e r e g i o n ∗∗ )calloc( 169, sizeof( cube r e g i o n ∗ ));

c h a r ∗ names[ 169 ] = { ”UNKNOWN” , ”TRACING” , ” MPI Accumulate” , ”MPI Allgather”, ”MPI Allgatherv”, ”MPI Allreduce”, ”MPI A8lltoall”, ”MPI Alltoallv”, ”MPI Alltoallw”, ”MPI B a r r i e r ” , ”MPI Bcast ” , ”MPI Bsend ” , ”MPI B s e n d i n i t ” , ” MPI Cancel”, ”MPI C a r t c r e a t e ” , ”MPI Comm create”, ”MPI Comm dup ” , ”MPI Comm free ” , ”MPI Comm group ” , ”MPI Comm remote group ” , ”MPI Comm split ” , ”MPI Exscan ” , ”MPI F i l e c l o s e ” , ”MPI F i l e d e l e t e ” , ”MPI F i l e g e t amode”, ”MPI F i l e g e t atomicity”, ”MPI F i l e g e t b y t e o f f s e t ” , ”MPI F i l e g e t group”, ”MPI F i l e g e t i n f o ” , ”MPI F i l e g e t p o s i t i o n ” , ”MPI F i l e g e t p o s i t i o n s h a r e d ” , ”MPI F i l e g e t s i z e ” , ”MPI F i l e g e t t y p e extent”, ”MPI F i l e g e t v i e w ” , ” M P I F i l e i r e a d ” , ”MPI F i l e i r e a d a t ” , ” M P I F i l e i r e a d s h a r e d ” , ”MPI F i l e i w r i t e ” , ”MPI F i l e i w r i t e a t ” , ”MPI F i l e i w r i t e s h a r e d ” , ”MPI F i l e o p e n ” , ”MPI F i l e preallocated”, ”MPI F i l e r e a d ” , ” M P I F i l e r e a d a l l ” , ” M P I F i l e r e a d a l l b e g i n ” , ”MPI F i l e r e a l a l l e n d ” , ” M P I F i l e r e a d a t ” , ”MPI F i l e r e a d a t a l l ” , ”MPI F i l e r e a d a t a l l b e g i n ” , ”MPI F i l e r e a d a t a l l e n d ” , ”MPI F i l e r e a d o r d e r e d ” , ”MPI F i l e r e a d o r d e r e d begin”, ”MPI F i l e r e a d o r d e r e d e n d ” , ” M P I F i l e r e a d s h a r e d ” , ”MPI F i l e s e e k ” , ”MPI F i l e s e e k s h a r e d ” , ”MPI F i l e s e t atomicity”, ”MPI F i l e s e t i n f o ” , ”MPI F i l e s e t s i z e ” , ”MPI F i l e s e t v i e w ” , ”MPI F i l e s y n c ” , ” M P I f i l e write”, ”MPI F i l e w r i t e a l l ” , ”MPI F i l e w r i t e a l l begin”, ”MPI F i l e w r i t e a l l e n d ” , ”MPI F i l e w r i t e a t a l l ” , ”MPI F i l e w r i t e a t a l l b e g i n ” ,

96 ”MPI F i l e w r i t e a t a l l e n d ” , ”MPI F i l e w r i t e o r d e r e d ” , ”MPI F i l e w r i t e o r d e r e d begin”, ”MPI F i l e w r i t e o r d e r e d e n d ” , ” M P I F i l e w r i t e s h a r e d ” , ”MPI F i n a l i z e ” , ”MPI Gather ” , ” mPI Gatherv ” , ”MPI Get ” , ”MPI Graph create ” , ”MPI G r o u p difference”, ”MPI Group excl ” , ”MPI Group f r e e ” , ”MPI G r o up i n c l ” , ” M P I G r o u p intersection”, ”MPI G r o u p r a n g e e x c l ” , ”MPI G r o u p r a n g e i n c l ” , ” MPI Group union ” , ”MPI Ibsend ” , ”MPI I n i t ” , ”MPI I n i t t h r e a d ” , ”MPI I n t e r c o m m c r e a t e ” , ”MPI Intercomm merge ” , ”MPI I r e c v ” , ”MPI Irsend”, ”MPI Isend”, ”MPI Issend ” , ”MPI Probe ” , ”MPI Put ” , ”MPI Recv ” , ”MPI R e c v i n i t ” , ”MPI Reduce ” , ”MPI R e d u c e s c a t t e r ” , ”MPI R e q u e s t f r e e ” , ”MPI Rsend ” , ”MPI R s e n d i n i t ” , ”MPI Scan ” , ”MPI S c a t t e r ” , ”MPI S c a t t e r v ” , ”MPI Send ” , ” M P I S e n d i n i t ” , ” MPI Sendrecv ” , ”MPI S e n d r e c v replace”, ”MPI Ssend ” , ”MPI S s e n d i n i t ” , ”MPI Start ” , ”MPI S t a r t a l l ” , ”MPI Test ” , ”MPI T e s t a l l ” , ”MPI Testany ” , ”MPI Testsome ” , ”MPI Wait ” , ”MPI Waitall ” , ”MPI Waitany”, ”MPI Waitsome”, ”MPI Win complete ” , ”MPI Win create ” , ”MPI Win fence ” , ”MPI Win free ” , ”MPI Win lock ” , ”MPI Win post ” , ”MPI Win start ” , ”MPI Win test ” , ”MPI Win unlock ” , ”MPI Win wait ” , ”PARALLEL” , ” d r i v e r ” , ” t a s k init”, ”bcast init”, ”barriere s y n c ” , ” r e a d i n p u t ” , ” b c a s t r e a l ” , ”decomp ” , ” i n n e r a u t o ” , ” i n n e r ” , ”initialize”, ” i n i t x s ” , ” i n i t s n c ” , ” o c t a n t ” , ”initgeom”, ” t i m e r s ” , ” i t s =1 ” , ”source”, ”sweep”, ” r c v r e a l ” , ” s n d real”, ”global i n i t s u m ” , ” f l u x e r r ” , ” g l o b a l r e a l m a x ” , ” i t s =2 ” , ” i t s =3 ” , ” i t s =4 ” , ” i t s =5 ” , ” i t s =6 ” , ” i t s =7 ” , ” i t s =8 ” , ” i t s =9 ” , ” i t s =10 ” , ” i t s =11 ” , ”its=12 ”,”global r e a l s u m ” , ” t a s k e n d ” } ;

u n s i g n e d i ; f o r ( i = 0 ; i < 1 6 9 ; i++ ) { c h a r ∗ d e s c r = ” ” ; i f ( i > 1 && i < 132 ) { descr = ”MPI”; } i f ( i == 1 | | i == 132 ) { descr = ”EPIK”; } i f ( i > 132 ) { descr = ”USR”; } regn[ i ] = cube d e f region( cube, names[ i ], names[ i ], ”mpi”, ”barrier”, 1, 1, ””, descr, mod ); }

/∗ actual definition of calltree ∗/ s r a n d ( 0 ) ; c u b e c n o d e ∗∗ c n o d e s vector = ( cube c n o d e ∗∗ )calloc( CNODES, sizeof( cube c n o d e ∗ )); i n t i d x = 0 ;

97 i n t r e g i o n i d x = 0 ; c r e a t e calltree( cube, regn, cnodes vector , &idx , ®ion idx, NULL, 0 ); /∗ define location tree ∗/ c ub e m ac h in e ∗ mach = c u b e d e f mach( cube, ”MSC<>”, ”” ) ; c u b e n o d e ∗ node = c u b e d e f node( cube, ”Athena<>”, mach ) ; c u b e p r o c e s s ∗∗ p r o c e s s e s vector = ( cube p r o c e s s ∗∗ )calloc( size , sizeof( cube p r o c e s s ∗ )); c u b e t h r e a d ∗∗ t h r e a d s vector = ( cube t h r e a d ∗∗ )calloc( size ∗ totalnthreads , sizeof( cube t h r e a d ∗ )); /∗ Define processes and threads ∗/ c r e a t e s y s t e m tree( cube, processes vector , threads vector, node, size ); free( regn );

c u b e s e t s t a t i s t i c name( cube, ”mystat” );

c u b e s e t m e t r i c s title( cube, ”Metric tree (QMCD App )”); c u b e s e t c a l l t r e e title( cube, ”Calltree (serial run, #2345)” ); c u b e s e t s y s t e m t r e e title( cube, ”System ( Linux cluster < 43 & 23 >)”);

/∗ generate header ∗/ c u b e d e f mirror( cube, ”http:// icl .cs.utk.edu/software/kojak/” ); c u b e d e f mirror( cube, ”http://www.fz juelich.de/jsc/kojak/” ); c u b e d e f attr( cube, ”description”, ”a simple example” ); c u b e d e f attr( cube, ”experiment time”, ”November 1st , 2004” ); c u b e e n a b l e f l a t tree( cube, CUBE FALSE ) ; }

if ( rank == 0 ) { t 0 = MPI Wtime ( ) ; printf( ”Definitions: %f \n”, t0 tstart ); }

MPI Barrier( MPI COMM WORLD ) ;

i n t n c a l l p a t h s ; i n t b i t m a s k size = ( CNODES+ 7 ) / 8; c h a r ∗ bitmask = malloc( bitmask s i z e ∗ sizeof( char ) );

/∗ First metric ∗/ w r i t e d o u b l e s l i n e b y line( cube, met vector[ 0 ], nthreads, totalnthreads , CNODES, bitmask, counts, displs , rank );

if ( rank == 0 ) { t 1 = MPI Wtime ( ) ; printf( ”1. metric: %f \n ” , t 1 t 0 ) ; }

/∗ Second metric ∗/ w r i t e i n t l i n e b y line( cube, met vector[ 1 ], nthreads, totalnthreads , CNODES, bitmask, counts, displs , rank );

if ( rank == 0 ) { t 2 = MPI Wtime ( ) ; printf( ”2. metric: %f \n ” , t 2 t 1 ) ; }

/∗ Third metric 5% sparse ∗/ n c a l l p a t h s = CNODES ∗ 0 . 0 5 ; c r e a t e b i t m a s k ( n callpaths , bitmask ); if ( rank == 0 ) { c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 2 ], bitmask ); } w r i t e i n t l i n e b y line( cube, met vector[ 2 ], nthreads, totalnthreads , n callpaths , bitmask, counts, displs , rank ); if ( rank == 0 ) { t 3 = MPI Wtime ( ) ; printf( ”3. metric: %f \n ” , t 3 t 2 ) ; }

/∗ Fourth metric 5% sparse ∗/ if ( rank == 0 ) { c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 3 ], bitmask ); } w r i t e i n t l i n e b y line( cube, met vector[ 3 ], nthreads, totalnthreads , n callpaths , bitmask, counts, displs , rank );

if ( rank == 0 ) { t 4 = MPI Wtime ( ) ; printf( ”4. metric: %f \n ” , t 4 t 3 ) ; }

/∗ Fifth metric 20% sparse ∗/ n c a l l p a t h s = CNODES ∗ 0 . 2 ; c r e a t e b i t m a s k ( n callpaths , bitmask ); if ( rank == 0 ) { c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 4 ], bitmask ); } w r i t e d o u b l e s l i n e b y line( cube, met vector[ 4 ], nthreads, totalnthreads , n callpaths , bitmask, counts, displs , rank );

98 if ( rank == 0 ) { t 5 = MPI Wtime ( ) ; printf( ”5. metric: %f \n ” , t 5 t 4 ) ; }

/∗ sixth metric ∗/ w r i t e t a u a t o m i c l i n e b y line( cube, met vector[ 5 ], nthreads, totalnthreads , CNODES, bitmask, counts, displs , rank );

if ( rank == 0 ) { t 6 = MPI Wtime ( ) ; printf( ”6. metric: %f \n ” , t 6 t 5 ) ; }

/∗ Seventh metric 20% sparse ∗/ n c a l l p a t h s = CNODES ∗ 0 . 2 ; c r e a t e b i t m a s k ( n callpaths , bitmask ); if ( rank == 0 ) { c u b e s e t k n o w n c n o d e s f o r metric( cube, met vector[ 6 ], bitmask ); } w r i t e t a u a t o m i c l i n e b y line( cube, met vector[ 6 ], nthreads, totalnthreads , n callpaths , bitmask, counts, displs , rank );

if ( rank == 0 ) { t 7 = MPI Wtime ( ) ; printf( ”7. metric: %f \n ” , t 7 t 6 ) ; }

free( bitmask ); free( counts ); free( displs );

/∗ D e l e t e c u b e t structure and release all memory∗/ if ( rank == 0 ) { c u b e free( cube ); f r e e ( m e t v e c t o r ) ; }

if ( rank == 0 ) { tend = MPI Wtime ( ) ; printf( ”Anchor file: %f \n”, tend t7 ); printf( ”All metrics: %f \n ” , t 7 t 0 ) ; printf( ”Alltogether: %f \n”, tend tstart ); }

if ( rank == 0 ) { printf( ”Test file %s.cubex complete. \ n”, cubefile ); } MPI Finalize (); /∗ f i n i s h ∗/ r e t u r n 0 ; }

i n t g e t n e x t r a n d o m ( ) { c u r r e n t random = ( current random + 5 ) % 13; return current r a n d o m ; }

v o i d s e t s e e d random( int s ) { c u r r e n t r a n d o m = s ; }

v o i d c r e a t e s y s t e m t r e e ( c u b e t ∗ cube , c u b e p r o c e s s ∗∗ process , cube t h r e a d ∗∗ thread , cube n o d e ∗ node, int size ) { unsigned i, j, nthrds, totthreads; n t h r d s = 1 ; totthreads = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { char proc[] =”Process”; char thrd[] =”Thread”; c u b e p r o c e s s ∗ p r o c 1 = c u b e d e f proc( cube, proc, i, node ); process[ i ] = proc1; f o r ( j = 0 ; j < nthrds; j++ ) { c u b e t h r e a d ∗ thread1 = cube d e f thrd( cube, thrd, totthreads, process[ i ] ); thread[ totthreads ] = thread1;

99 totthreads++; }

nthrds = ( nthrds == MAXNTHREADS ) ? 1 : nthrds + 1; } }

v o i d c r e a t e calltree( cube t ∗ cube , c u b e r e g i o n ∗∗ regn , c u b e c n o d e ∗∗ cnodes , i n t ∗ idx , i n t ∗ r e g i o n i d x , c u b e c n o d e ∗ parent, int level ) { i f ( ( ∗ i d x ) >= CNODES ) { r e t u r n ; } i f ( l e v e l >= MAXDEPTH ) { r e t u r n ; } i n t n u m children = get n e x t r a n d o m ( ) ; u n s i g n e d i ; f o r ( i = 0 ; i < n u m children; i++ ) { unsigned rand r e g = ∗ r e g i o n i d x ; ∗ r e g i o n i d x = ( ( ∗ r e g i o n idx + 1 ) % 169 ); // cyclic selection of regions c u b e c n o d e ∗ cnode = c u b e d e f cnode( cube, regn[ rand reg ], parent ); c n o d e s [ ∗ idx ] = cnode; ( ∗ i d x )++; c r e a t e calltree( cube, regn, cnodes, idx, region idx, cnode, level + 1 ); i f ( ( ∗ i d x ) >= CNODES ) { return; // for the case, inside of those function was reached the end of array } } }

v o i d c a l c u l a t e m a x d e p t h ( ) { MAXDEPTH = ( i n t ) ( l o g ( CNODES ) / l o g ( 6 ) ) + 3 ; }

v o i d f i l l c u b e t y p e d o u b l e r o w ( d o u b l e ∗ row, unsigned size , double value ) { unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { row[ i ] = value + i; } }

v o i d f i l l c u b e t y p e i n t r o w ( i n t 6 4 t ∗ row, unsigned size , int value ) { unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { row[ i ] = value + i; } }

v o i d f i l l c u b e t y p e t a u a t o m i c r o w ( c u b e t y p e t a u a t o m i c ∗ row, unsigned size , unsigned value ) { unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { double A= ( double )( value + i );

i n t a = ( int )floor( A / 2. ); i n t a = ( int )ceil( A/ 2. );

row [ i ] . N = ( a + a + 1 ) ; row[ i ].Min = a ; row[ i ].Max = a ; row[ i ].Sum = ( a != a ) ? a : 0 ; row[ i ].Sum2 = calcaulate s u m 2 ( a , a ); } }

i n t c a l c a u l a t e sum2( int start, int finish ) { i n t sum = 0 ; int i = start; for ( i = start; i <= finish; i++ ) { sum += i ∗ i ;

100 } r e t u r n sum ; }

c h a r ∗ c r e a t e bitmask( int cnodes, char ∗ bitmask ) { memset( bitmask, 0x00, ( CNODES + 7 ) / 8 ); i n t i = 0 ; f o r ( i = 0 ; i < cnodes; i = i +1 ) { c u b e s e t bit( bitmask, i ); } // c u b e s e t k n o w n c n o d e s f o r metric( cube, met, bitmask ); return bitmask; }

v o i d w r i t e d o u b l e s l i n e b y l i n e ( c u b e t ∗ cube , c u b e m e t r i c ∗ met, int nthreads, int totnthreads , int ncallpaths , char ∗ bitmask, int ∗ counts , i n t ∗ displs, int rank ) { c a r r a y ∗ s e q u e n c e ; u n s i g n e d ∗ idtocid = calloc( CNODES, sizeof( unsigned ) ); i n t s i z e ; if ( rank == 0 ) { sequence = cube g e t c n o d e s f o r metric( cube, met ); size = sequence > s i z e ; unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { idtocid[ i ] = ( ( cube c n o d e ∗ ) ( s e q u e n c e > data ) [ i ] ) > i d ; } } MPI Bcast( &size , 1, MPI INT , 0 , MPI COMM WORLD ) ; MPI Bcast( idtocid , CNODES, MPI UNSIGNED , 0 , MPI COMM WORLD ) ; d o u b l e ∗ local = malloc( nthreads ∗ sizeof( double ) ); d o u b l e ∗ global = malloc( totnthreads ∗ sizeof( double ) ); unsigned ci; i n t w = 0 ; for ( ci = 0; ci < s i z e ; c i++ ) { if ( CNODES != ncallpaths && ! c u b e i s b i t set( bitmask, ci ) ) { c o n t i n u e ; }

f i l l c u b e t y p e d o u b l e row( local, nthreads, idtocid[ ci ] + 1 ); MPI Gatherv( local , nthreads , MPI DOUBLE, global , counts , displs , MPI DOUBLE, 0 , MPI COMM WORLD ) ; if ( rank == 0 ) { c u b e w r i t e s e v r o w o f doubles( cube, met, ( cube c n o d e ∗ ) ( s e q u e n c e > data[ ci ] ), global ); }

w++; } //free( sequence ); free( idtocid ); free( local ); free( global ); }

v o i d w r i t e i n t l i n e b y l i n e ( c u b e t ∗ cube , c u b e m e t r i c ∗ met, int nthreads, int totnthreads , int ncallpaths , char ∗ bitmask, int ∗ counts , i n t ∗ displs, int rank ) { u n s i g n e d ∗ idtocid = malloc( CNODES ∗ sizeof( unsigned ) ); c a r r a y ∗ s e q u e n c e ; i n t s i z e ; if ( rank == 0 ) { sequence = cube g e t c n o d e s f o r metric( cube, met ); size = sequence > s i z e ; unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { idtocid[ i ] = ( ( cube c n o d e ∗ ) ( s e q u e n c e > data ) [ i ] ) > i d ; } } MPI Bcast( &size , 1, MPI INT , 0 , MPI COMM WORLD ) ; MPI Bcast( idtocid , CNODES, MPI UNSIGNED , 0 , MPI COMM WORLD ) ;

i n t 6 4 t ∗ local = malloc( nthreads ∗ sizeof( int64 t ) ) ; i n t 6 4 t ∗ global = malloc( totnthreads ∗ sizeof( int64 t ) ) ; unsigned ci; i n t w = 0 ; for ( ci = 0; ci < s i z e ; c i++ ) { if ( CNODES != ncallpaths && ! c u b e i s b i t set( bitmask, ci ) ) { c o n t i n u e ; }

101 f i l l c u b e t y p e i n t row( local, nthreads, idtocid[ ci ] + 1 ); MPI Gatherv( local , nthreads , MPI INT64 T, global, counts, displs , MPI INT64 T , 0 , MPI COMM WORLD ) ; if ( rank == 0 ) { c u b e w r i t e s e v r o w o f int64( cube, met, ( cube c n o d e ∗ ) ( s e q u e n c e > data[ ci ] ), global ); }

w++; } free( idtocid ); //free ( sequence ); free( local ); free( global ); }

v o i d w r i t e t a u a t o m i c l i n e b y l i n e ( c u b e t ∗ cube , c u b e m e t r i c ∗ met, int nthreads, int totnthreads, int ncallpaths , char ∗ bitmask, int ∗ counts , i n t ∗ displs, int rank ) { u n s i g n e d ∗ idtocid = malloc( CNODES ∗ sizeof( unsigned ) ); c a r r a y ∗ s e q u e n c e ; i n t s i z e ; if ( rank == 0 ) { sequence = cube g e t c n o d e s f o r metric( cube, met ); size = sequence > s i z e ; unsigned i = 0; f o r ( i = 0 ; i < s i z e ; i++ ) { idtocid[ i ] = ( ( cube c n o d e ∗ ) ( s e q u e n c e > data ) [ i ] ) > i d ; } } MPI Bcast( &size , 1, MPI INT , 0 , MPI COMM WORLD ) ; MPI Bcast( idtocid , CNODES, MPI UNSIGNED , 0 , MPI COMM WORLD ) ;

c u b e t y p e t a u a t o m i c ∗ local = malloc( nthreads ∗ sizeof( cube t y p e t a u a t o m i c ) ) ; c u b e t y p e t a u a t o m i c ∗ global = malloc( totnthreads ∗ sizeof( cube t y p e t a u a t o m i c ) ) ;

MPI Datatype tau a t o m i c ; MPI Type contiguous( sizeof( cube t y p e t a u atomic ), MPI BYTE , &t a u a t o m i c ) ; MPI Type commit( &tau a t o m i c ) ;

unsigned ci; i n t w = 0 ; for ( ci = 0; ci < s i z e ; c i++ ) { if ( CNODES != ncallpaths && ! c u b e i s b i t set( bitmask, ci ) ) { c o n t i n u e ; }

f i l l c u b e t y p e t a u a t o m i c row( local, nthreads, idtocid[ ci ] + 1 ); MPI Gatherv( local , nthreads , tau atomic, global , counts, displs , tau atomic , 0, MPI COMM WORLD ) ; if ( rank == 0 ) { c u b e w r i t e s e v r o w o f c u b e t y p e t a u atomic( cube, met, ( cube c n o d e ∗ ) ( s e q u e n c e > data[ ci ] ), global ); }

w++; } free( idtocid ); //free( sequence ); free( local ); free( global ); }

102