Transferring User-Defined Types in Openacc

Transferring user-defined types in OpenACC James Beyer, David Oehmke, Jeff Sandoval Cray, Inc. Saint Paul, USA (beyerj|doehmke|sandoval)@cray.com Abstract—A preeminent problem blocking the adoption of high-level directive-based programming model. In fact, the OpenACC by many programmers is support for user-defined resulting code is very similar to that required when writing types: classes and structures in C/C++ and derived types in a CUDA program [2]. For OpenACC, these details should Fortran. This problem is particularly challenging for data structures that involve pointer indirection, since transferring instead be handled by the compiler. these data structures between the disjoint host and accelerator The Cray compiler provides a deep-copy command line memories found on most modern accelerators requires deep- option for Fortran [3]. However, experience has shown that copy semantics. This paper will look at the mechanisms users often have extremely large data sets contained in their available in OpenACC 2.0 to allow the programmer to design derived types and only want to transfer the parts that they transfer routines for OpenACC programs. Once these mechanisms have been explored, a new directive-based solution will be need for a given kernel. As a result, this paper will present a presented. Code examples will be used to compare the current more flexible directive-based solution. Code examples will state-of-the-art and the new proposed solution. be used to compare the current state-of-the-art and the new Keywords-OpenACC; user-defined types; accelerators; com- proposed solution. The directives we propose provide a piler directives; heterogeneous programming mechanism to shape an array contained inside of a structure, which will bring C and C++ programs to the same level I. INTRODUCTION as Fortran concerning deep copies. This same mechanism A preeminent problem blocking the adoption of Open- can also be used to limit which parts of a user defined ACC by many programmers is support for user-defined object are transferred, avoiding the all-or-nothing problem types: classes and structures in C/C++ and derived types in encountered with the Cray deep- copy option. Along with Fortran. User-defined types are an important part of many elegantly solving the shaping problem, this mechanism also user applications; for example, in C and C++ something reduces the level of difficulty for the programmer since they as simple as a complex number is really a structure of only need to express the shape of a pointer rather than two floats. This problem is particularly challenging for data program how to move the memory behind the pointer. structures that involve pointer indirection, since transferring these data structures between the disjoint host and acceler- II. BACKGROUND AND PROBLEM ator memories found on most modern accelerators requires OpenACC has limited support for user-defined types: a deep-copy semantics. Unfortunately, the current OpenACC variable of user-defined type must be a flat object – that specification only allows for shaping the top-level of an is, it may not contain pointer indirection (including Fortran array; there is no mechanism for shaping a pointer inside of allocatable members). This restriction simplifies implemen- a structure. This problem is manageable for Fortran, where tations, ensuring that all variables are identical on both host most objects are self-describing, but unavoidable for C and and device and may be transferred as contiguous blocks C++, where all pointers are unshaped. of memory. However, feedback from the OpenACC user There are really only two solutions to this problem community indicates that this restriction is a significant available in OpenACC 2.0 [1]: (1) refactor the code to stop impediment to porting interesting data structures and algo- using pointers in structures (i.e. use arrays of structures to rithms to OpenACC. This section describes the challenges replace structures of arrays) and (2) use an API to move the to relaxing this restriction. objects to the device. Variables and data structures that contains pure data are The second solution is preferable because it affects only interpreted the same on the host and device – therefore, code that performs data transfers rather than all code that an implementation may transfer such a variable to and makes use of the user-defined type. This paper will look from device memory as a simple memory transfer, without at the mechanisms available in OpenACC 2.0 to allow regard for the contents of that block of memory.1 On the programmer to design transfer routines for OpenACC programs. Using a simple example we will show that this 1Although it is technically possible for a host and device to interpret pure data differently, architectures that follow common conventions and approach requires writing large amounts of low-level code standards (e.g., equivalent endianness, two’s complement and IEEE floating to manage device memory, which defeats the goal of a point) are compatible. struct { struct { int *x; // dynamic size 2 int *x; // dynamic size 2 } *A; // dynamic size 2 Shallow Copy } *A; // dynamic size 2 #pragma acc data copy(A[0:2]) #pragma acc data copy(A[0:2]) Host Memory: x[0] x[1] A[0].x A[1].x x[0] x[1] Host Memory: x[0] x[1] A[0].x A[1].x x[0] x[1] Device Memory: x[0] x[1] dA[0].x dA[1].x x[0] x[1] Device Memory: dA[0].x dA[1].x Deep Copy Figure 1. Shallow copy Figure 2. Deep copy the other hand, data structures that contain pointers are to copy only some subset of deep members. Addressing interpreted differently on the host and device. A host pointer these problems requires additional information from the targets a location in host memory and a device pointer programmer. targets a location in device memory; either pointer may Finally, throughout the paper we will distinguish between be legally dereferenced only on its corresponding device.2 two types of members for user-defined types: direct members If an implementation transfers a data structure using a and indirect members. Direct members store their contents simple transfer, then pointers in that data structure will directly in the contiguous memory of the user-defined type. refer to invalid locations – this transfer policy is referred A scalar or statically-sized array member is a direct member. to as shallow copy, which is illustrated in Figure1. For In contrast, indirect members store their contents outside of comparison, shallow copy is acceptable for shared-memory the contiguous memory of the user-defined type, where it programming models like OpenMP because it is always is accessed through pointer indirection. Thus, an indirect legal to dereference a pointer (although shallow-copy does member always has a corresponding direct member that imply that the target of member pointers will be shared stores the address or access mechanism of the indirect among all copies). In contrast, shallow copy is less useful part. For example, a pointer member in C/C++ is a direct in OpenACC because dereferencing the resulting pointer member, but the target of the pointer is an indirect member. may not be legal. Instead, users often require deep-copy In Fortran, an allocatable member is an indirect member, semantics, as illustrated in Figure2, where every object in a but the underlying dope vector for the allocatable data is a data structure is transferred. Deep copy requires recursively direct member. Since direct members are contained directly traversing pointer members in a data structure, transferring in an object, they are automatically transferred as part of that all disjoint memory locations, and translating the pointers object. But indirect members are not contained directly in to refer to the appropriate device location. This technique is an object, and thus require deep-copy semantics to correctly also known as object serialization or marshalling, which is transfer them. commonly used for sending complex data structures across networks or writing them to disk. For example, Cray’s first A. Desired capabilities deep-copy prototype implementation was inspired by MPI, There are several distinct capabilities that we can attempt which has basic support for describing the layout of user- to offer users to improve support for user-defined types with defined data types and sending user-defined objects between indirect members. These techniques are all various forms of ranks [4]. deep-copy, where every sub-object on the device must be A Fortran compiler could automatically perform deep accessible through a chain of parent objects originating at a copy, since Fortran pointers are self-describing dope vectors. single base object – this ensures that addressing calculations In fact, deep copy is the language-specified behavior for and access mechanisms are identical for both host and intrinsic assignment to a derived type containing allocatable device code. Additionally, these techniques all assume that members. Unfortunately, a C/C++ compiler cannot automat- individual objects are transferred in whole and have an ically perform deep copy, since C/C++ pointers are raw identical layout in both host and device memory. pointers that do not contain shape information. Further, even Member shapes: Allowing users to explicitly shape for self-describing dope-vectors in Fortran, a user may desire member pointers puts C/C++ on the same footing as Fortran: pointers become self describing. This first step is important 2Pointers are functionally equivalent if a host and device share common because it makes automatic deep-copy possible. However, memory. Even so, dereferencing such a non-local pointer will likely affect performance due to memory locality; thus, on these devices it may still be member shapes aren’t useful alone, since they don’t imply desirable to physically transfer memory targeted by pointers.

Transferring User-Defined Types in Openacc

Pathscale™ Ekopath™ Compiler Suite User Guide

Runtime Storage

Margaret Jacobs Phd Thesis

Lecture Materials

Symmetric Data Objects and Remote Memory Access Communication for Fortran 95 Applications

Programming Language 10CS666 Dept of CSE,SJBIT 1 MODEL

Sparcompiler Ada Programmer's Guide

Session 6: Data Types and Representation

Complex Data Management Tech Report

Improving Dynamic Binary Optimization

Chapter 7:: Data Types

Data Structures