Inner Array Inlining for Structure of Arrays Layout

Inner Array Inlining for Structure of Arrays Layout Matthias Springer Yaozhu Sun Hidehiko Masuhara Tokyo Institute of Technology Tokyo Institute of Technology Tokyo Institute of Technology Japan Japan Japan [email protected] [email protected] [email protected] Abstract mathematics or real-world simulations (e.g., traffic simu- Previous work has shown how the well-studied and SIMD- lations [23, 24]). Our recent focus is on applications that friendly Structure of Arrays (SOA) data layout strategy can are amenable to object-oriented programming such as agent- speed up applications in high-performance computing com- based simulations (e.g., traffic flow simulations [21]) or graph pared to a traditional Array of Structures (AOS) data lay- processing algorithms. Due to conceptual differences in the out. However, a standard SOA layout cannot handle struc- programming models and hardware architectures between tures with inner arrays; such structures appear frequently in CPUs and GPUs, developing performant GPU programs is graph-based applications and object-oriented designs with not straightforward for most programmers. There are a num- associations of high multiplicity. ber of best practices for achieving good performance on This work extends the SOA data layout to structures with GPUs, ranging from control flow optimizations to data lay- array-typed fields. We present different techniques for in- out optimizations; such optimizations are often tedious to lining (embedding) inner arrays into an AOS or SOA layout, implement, lead to less readable code and interfere with pro- as well as the design and implementation of an embedded gramming abstractions. One well-studied best practice for C++/CUDA DSL that lets programmers write such layouts CPUs and GPUs is Structure of Arrays (SOA). in a notation close to standard C++. We evaluate several AOS and SOA Well-organized programs make frequent layout strategies with a traffic flow simulation, an important use of structs or classes. Array of Structures (AOS) and Struc- real-world application in transport planning. ture of Arrays (SOA) describe techniques for organizing CCS Concepts • Software and its engineering → Ob- multiple structs/objects in memory. In AOS, the standard ject oriented languages; Data types and structures; Parallel technique in most compilers, all fields of a struct are stored programming languages; together. In SOA, all values of a field are stored together. SOA can benefit cache utilization [4] and is useful in SIMD Keywords CUDA, Data Inlining, Inner Arrays, Object-oriented programs: Two values of the same field but different object Programming, SIMD, Structure of Arrays can be loaded into a vector register in a single instruction. ACM Reference Format: Similarly, GPUs can combine multiple simultaneous global Matthias Springer, Yaozhu Sun, and Hidehiko Masuhara. 2018. Inner memory requests within a warp (group of 32 consec. threads) Array Inlining for Structure of Arrays Layout. In Proceedings of 5th into fewer coalesced requests, if data is spatially local [6, 18]. ACM SIGPLAN International Workshop on Libraries, Languages, and SOA works well with simple data structures, but cannot Compilers for Array Programming (ARRAY’18). ACM, New York, NY, be applied easily to structs that contain arrays or other col- USA,9 pages. https://doi.org/10.1145/3219753.3219760 lections of different size. Such arrays must be allocated at separate locations, requiring an additional pointer indirec- 1 Introduction tion (Fig.1). In this work, we study two important real-world In recent years a variety of applications [19] have been im- plemented on GPUs in order to utilize their massive paral- distance num_neighbors neighbors lelism, in areas such as database systems, machine learning, : int[1000] : int[1000] : Vertex*[1000] 0 5 0xb01fd0000 Permission to make digital or hard copies of all or part of this work for 1 2 0xb01fd0020 personal or classroom use is granted without fee provided that copies 1 1 0xb01fd0028 are not made or distributed for profit or commercial advantage and that ∞ 1 0xb01fd002c copies bear this notice and the full citation on the first page. Copyrights ∞ 0xb01fd0044 for components of this work owned by others than the author(s) must 3 be honored. Abstracting with credit is permitted. To copy otherwise, or ... accessing obj.distance When processing multiple ... accessing obj.neighbors[0] obj ∈ Vertex simultaneously, ... republish, to post on servers or to redistribute to lists, requires prior specific can be coalesced is unlikely to be coalesced permission and/or a fee. Request permissions from [email protected]. ARRAY’18, June 19, 2018, Philadelphia, PA, USA Figure 1. Example: Vertex class in SOA memory layout with 1000 objects. © 2018 Copyright held by the owner/author(s). Publication rights licensed Values for distance and num_neighbors are stored in SOA arrays, but to Association for Computing Machinery. neighbor arrays must be stored in a different location, because every array ACM ISBN 978-1-4503-5852-1/18/06...$15.00 may have a different size. This requires an additional pointer indirection and https://doi.org/10.1145/3219753.3219760 simultaneous accesses into neighbor arrays are unlikely to be coalesced. 50 ARRAY’18, June 19, 2018, Philadelphia, PA, USA Matthias Springer, Yaozhu Sun, and Hidehiko Masuhara int[100] applications: breadth-first search (BFS) and an agent-based, Vertex*[5] Vertex[100] v0.distance v0.neighbors[0] object-oriented traffic flow simulation. In graph processing, v0.distance v1.distance v0.neighbors[1] v0.num_neighbors Vertex**[100] most vertices have multiple neighbors and adjacency lists v0.neighbors[2] v0.neighbors v .neighbors v .neighbors[3] 0 are the preferred representation for BFS on GPUs [5]. The v .distance 0 1 v1.neighbors v0.neighbors[4] traffic flow simulations exhibits graph-based features for v1.num_neighbors Vertex*[2] int[100] v1.neighbors representing street networks and utilizes array-based data v1.neighbors[0] v0.num_neighbors structures within the simulation logic. Algorithms that iter- v1.neighbors[1] v1.num_neighbors ate over arrays one-by-one or in a fashion that is uniform among all objects are particularly interesting, because their Figure 2. No Inlining: AOS (left side) and SOA (right side). Low memory memory access can be optimized. While a standard SOA lay- footprint, but inner arrays must be accessed through a pointer indirection. out groups together all elements of an array per object, a Vertex[100] int[100] Vertex*[100] different, more SIMD-friendly layout can group elements by v0.distance v0.distance v0.neighbors[0] v0.num_neighbors v1.distance v1.neighbors[0] array index and benefit from memory coalescing. v .neighbors[0] 0 int[100] Vertex*[100] v0.neighbors[1] v0.num_neighbors v0.neighbors[1] v0.neighbors[2] v .num_neighbors Ikra-Cpp Even though SOA can lead to better memory 1 v1.neighbors[1] v0.neighbors[3] Vertex*[100] performance, it cannot be combined with object-oriented v0.neighbors[4] v0.neighbors[2] programming (OOP) in CUDA and many other programming v1.distance languages while maintaining OOP abstractions [22]. C++ li- v1.num_neighbors v1.neighbors[0] Vertex*[100] braries like SoAx [7] and ASX [25], or the ispc compiler [20] v1.neighbors[1] v0.neighbors[3] lay out data as SOA while providing an AOS-style programming interface, but they do not support object-oriented pro- Vertex*[100] Allocated, but unused v .neighbors[4] gramming. Neither do they support SIMD-friendly array- (memory waste) 0 typed fields. The goal of the Ikra-Cpp project is to provide a set of language abstractions and optimizations for object- Figure 3. Full Inlining: AOS (left side) and SOA (middle, right side). High oriented high-performance computing on CPUs and GPUs. memory footprint if inner arrays have different sizes, but arrays can be The focus of this paper is on array-typed fields. accessed without pointer indirection. Vertex[100] int[100] Vertex*[100] Contributions and Outline This paper makes the follow- v0.distance v0.distance v0.neighbors[0] v .distance v .neighbors[0] ing contributions. v0.num_neighbors 1 1 v .neighbors[0] 0 int[100] Vertex*[100] v0.neighbors[1] v0.num_neighbors v0.neighbors[1] • An analysis and performance evaluation of data layout v0.neighbors (ext) v1.num_neighbors v1.neighbors[1] strategies for array-typed fields in a SOA data layout. v1.distance Vertex**[100] v1.num_neighbors Vertex*[100] • The design and implementation of Ikra-Cpp, an em- v .neighbors[0] v0.neighbors (ext) 1 v0.neighbors[2] v .neighbors[1] v1.neighbors (ext) bedded C++/CUDA DSL that allows programmers to 1 v0.neighbors[3] v .neighbors (ext) design object-oriented programs with array-typed fields 1 v0.neighbors[4] in AOS-style notation, while storing data in SOA lay- nullptr out with inner array inlining. • A parallel implementation of a traffic flow simulation, Figure 4. Partial Inlining: AOS (left side) and SOA (middle, right side). The first two inner array elements can be accessed without pointer direction, an important real-world application in transport plan- but a conditional branch is required to check if an element is stored in the ning, based on cellular automata on GPUs. object (inlined) or in the external storage. In the remainder of this paper, Sections2 and3 discuss layout int[100] Vertex*[5][100] v0.distance v0.neighbors[0] strategies for array-typed field and their API/implementa- v1.distance v0.neighbors[1] v .neighbors[2] tion in Ikra-Cpp. Section4 shows how they can be applied in int[100] 0 v0.neighbors[3] a larger project, a traffic flow simulation. Section5 presents v0.num_neighbors v0.neighbors[4] v1.num_neighbors a performance evaluation with various layout strategies. Fi- v1.neighbors[0] nally, Sections6 and7 discuss related work and conclude. v1.neighbors[1] 2 Data Layout Strategies for Inner Arrays This section describes seven inner array layout strategies, Figure 5. SOA, Array as Object.

Inner Array Inlining for Structure of Arrays Layout

R from a Programmer's Perspective Accompanying Manual for an R Course Held by M

Parallel Functional Programming with APL

Programming Language

RAL-TR-1998-060 Co-Array Fortran for Parallel Programming

Programming the Capabilities of the PC Have Changed Greatly Since the Introduction of Electronic Computers

Tangent: Automatic Differentiation Using Source-Code Transformation for Dynamically Typed Array Programming

Abstractions for Programming Graphics Processors in High-Level Programming Languages

Design and Implementation of an Array Language

Runtime Compilation of Array-Oriented Python Programs

APL / J by SeungJin Kim and Qing Ju

The Functional Imperative: Shape!

Arrays of Objects Morten Kromberg Dyalog Ltd