SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance

SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance Chunyang Gou, Georgi Kuzmanov, Georgi N. Gaydadjiev Computer Engineering Lab Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology, The Netherlands {C.Gou, G.K.Kuzmanov, G.N.Gaydadjiev}@tudelft.nl ABSTRACT layout and its continuity in the linear memory address space. We propose to bridge the discrepancy between data represen- It has been found that most SIMDized applications are in fa- tations in memory and those favored by the SIMD proces- vor of operating on SoA format for better performance[18, sor by customizing the low-level address mapping.Toachieve 20]. However, data representation in the system memory is this, we employ the extended Single-Affiliation Multiple-Stride mostly in the form of AoS because of two reasons. First, AoS (SAMS) parallel memory scheme at an appropriate level in the is the natural data representation in many scientific and en- memory hierarchy. This level of memory provides both Ar- gineering applications. Secondly, indirections to structured ray of Structures (AoS) and Structure of Arrays (SoA) views data, such as pointer or indexed array accesses, are in favor for the structured data to the processor, appearing to have of the AoS format. Therefore, a pragmatic problem in the maintained multiple layouts for the same data.Withsuch SIMDization arises: the need for dynamic data format trans- multi-layout memory, optimal SIMDization can be achieved. form between AoS and SoA, which results in significant perfor- Our synthesis results using TSMC 90nm CMOS technology mance degradation. To our best knowledge, no trivial solution indicate that the SAMS Multi-Layout Memory system has ef- for this problem has been previously proposed. Our SAMS ficient hardware implementation, with a critical path delay of Multi-Layout Memory system, presented in this paper, sup- less than 1ns and moderate hardware overhead. Experimen- ports contiguous data access for both AoS and SoA formats. tal evaluation based on a modified IBM Cell processor model The specific contributions of our proposal are: suggests that our approach is able to decrease the dynamic in- • Custom, low-level address mapping logic to manage indi- struction count by up to 49% for a selection of real applications vidual internal data layout and provide efficient memory and kernels. Under the same conditions, the total execution accesses for both AoS and SoA views; time can be reduced by up to 37%. • Categories and Subject Descriptors: C.1.2[Multiple Data Novel hardware/software interface for improved program- Stream Architectures (Multiprocessors)]:SIMD; B.3.2[Design mer productivity and additional performance gains; Styles]:Interleaved memories • The SAMS scheme implementation in TSMC 90nm CMOS General Terms: Design, Performance technology with affordable critical path (<1ns) and its integration into the IBM Cell SPE model; 1. INTRODUCTION • One of the most critical challenges in SIMD processing is Up to 49% improvement in dynamic instruction counts imposed by the data representation. By exploiting explicitly for real applications and kernels, which is translated into expressed data parallelism, SIMD processors tend to provide a 37% reduction of the overall execution time. higher performance for computationally intensive applications The remainder of the paper is organized as follows. In Sec- with lower control overhead compared to superscalar micro- tion 2, we provide the motivation for this work. In Section 3, processors. However, SIMDization suffers from the notorious theoriginalSAMSschemeandtheproposedextensionsare problems of difficult data alignment and arrangement, which briefly described. The hardware implementation and synthe- greatly undermine its potential performance[4, 26, 31]. sis results of the SAMS Multi-Layout Memory system and its In both scientific and commercial applications, data is usu- integration to the IBM Cell SPE are presented in Section 4. ally organized in a structured way. A sequence of structured Simulated performance of the SAMS memory in applications data units could be represented either in AoS (Array of Struc- and kernels is evaluated in Section 5. The major differences be- tures) or in SoA (Structure of Arrays) format. Such data rep- tween our proposal and related art are described in Section 6. resentation predetermines, at the application level, the data Finally, Section 7 concludes the paper. Permission to make digital or hard copies of all or part of this work for 2. MOTIVATION personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies Motivation Example: We shall examine the SIMDization of bear this notice and the full citation on the first page. To copy otherwise, to vector-matrix multiplication, c=a*B,wherea and c are 1x3 republish, to post on servers or to redistribute to lists, requires prior specific vectors and B is a 3x3 matrix with column-major storage. Al- permission and/or a fee. ICS’10, June 2–4, 2010, Tsukuba, Ibaraki, Japan. though the involved computations are quite simple, SIMDizing Copyright 2010 ACM 978-1-4503-0018-6/10/06 ...$10.00. them to achieve optimal speedup is very difficult. Assuming 1 struct vec3 {float a0, float a1, float a2}; struct matrix3 {float b0, float b3, float b6, float b1, float b4, float b7, struct vec3 {float a0, float a1, float a2}; float b2, float b5, float b8}; struct vec3 {float a0[4], float a1[4], struct matrix3 {float b0, float b3, float b6, float b1, float b4, … float a2[4]}; float b7, float b2, float b5, float b8}; struct vec3 a[4], c[4]; struct matrix3 {float b0[4], float … struct matrix3 b[4]; b3[4], float b6[4], float b1[4], float struct vec3 a[4], c[4]; vector float a0, a1, a2, b0, b1, b2, b3, b4, b5, b6, b7, b8, c0, c1, c2; b4[4], float b7[4], float b2[4], float struct matrix3 b[4]; // global addresses in main memory: gaa (for a), gab (b), gac (c) b5[4], float b8[4]}; vector float a0, a1, a2, b0, b1, b2, b3, b4, b5, b6, b7, b8, … … c0, c1, c2; // reading data from main memory to multi-layout memory struct vec3 a, c; … AOS_DMA_GET(a, gaa, 4*sizeof(vec3), tag, vec3); struct matrix3 b; // load a with AoS format, transform it to SoA with shuffles AOS_DMA_GET(b, gab, 4*sizeof(matrix3), tag, matrix3); vector float a0, a1, a2, b0, b1, b2, vector float tmp0 = *(vector float*)a; // load a with SoA view b3, b4, b5, b6, b7, b8, c0, c1, c2; vector float tmp1 = *(vector float*)(&a[1].a1); BEGIN_MULTI_VIEW(vec3); … vector float tmp2 = *(vector float*)(&a[2].a2); a0 = SOA_GET(&a[0].a0); // load a with SoA format a0 = spu_shuffle(tmp0, tmp1, pattern_0360); a1 = SOA_GET(&a[0].a1); a0 = *(vector float*)a.a0; a0 = spu_shuffle(a0, tmp2, pattern_0125); a2 = SOA_GET(&a[0].a2); a1 = *(vector float*)a.a1; a1 = spu_shuffle(tmp1, tmp2, pattern_0360); // load b with SoA view a2 = *(vector float*)a.a2; a1 = spu_shuffle(a0, tmp0, pattern_5012); BEGIN_MULTI_VIEW(matrix3); ... a2 = spu_shuffle(tmp2, tmp0, pattern_0360); b0 = SOA_GET(&b[0].b0); // load b with SoA format a2 = spu_shuffle(a0, tmp1, pattern_2501); b1 = SOA_GET(&b[0].b1); … // load b with AoS format, transform it to SoA with shuffles … // do computation … b8 = SOA_GET(&b[0].b8); c0 = a0*b0+a1*b3+a2*b6; // do computation // do computation c1 = a0*b1+a1*b4+a2*b7; c0=a0*b0+a1*b3+a2*b6 c0=a0*b0+a1*b3+a2*b6 c2 = a0*b2+a1*b5+a2*b8; c1=a0*b1+a1*b4+a2*b7 c1=a0*b1+a1*b4+a2*b7 // store results to c with SoA format c2=a0*b2+a1*b5+a2*b8 c2=a0*b2+a1*b5+a2*b8 *(vector float*)c.a0 = c0; // transform results from SoA to AoS format with shuffles, // store c with SoA view *(vector float*)c.a1 = c1; store them to c BEGIN_MULTI_VIEW(vec3); *(vector float*)c.a2 = c2; ... SOA_PUT(&c[0].a0, c0); SOA_PUT(&c[0].a1, c1); SOA_PUT(&c[0].a2, c2); // writing data from multi-layout memory to main memory AOS_DMA_PUT(c, gac, 4*sizeof(vec3), tag, vec3); SoA storage + AoS storage + SoA SIMDization SoA SIMDization AoS storage + SoA SIMDization + Multi-layout memory Figure 2: Sample vector-matrix multiplication code B[4] 3 3 3 1 a[4] b0 b1 b2 used in some applications to alleviate the data alignment prob- 2 2 2 c[4] = a[4] * B[4] b0 b1 b2 1 1 1 where 3 b0 3 b1b53 b2 i i i i i i i lem, but at the cost of wasted memory space and additional 3 3 3 b3 b4 0 3 0 0 c0 =a0 *b0 +a1 *b3 +a2 *b6 a0 a1 a2 2 b0 2 b1 2 b2 i i i i i i i 2 2 2 b3 b4 b5 c1 =a0 *b1 +a1 *b4 +a2 *b7 a0 a1 a2 1 1 1 i i i i i i i memory bandwidth which can become prohibitively expensive parallelism exploited 1 1 1 3 3 3 c2 =a0 *b2 +a1 *b5 +a2 *b8 a0 a1 a2 b3 b4 b5 (S 0 0 0 b6 b7 0 b8 0 0 (i=0,1,2,3) o 2 b3 2 b4 2 b5 s A SIM a0 a1 a2 b b b for some applications (e.g. the Wilson-Dirac kernel). ch 6 7 8 e 1 1 1 me) b b b D 6 0 7 0 8 0 iz Fortunately, it is common in applications with high data a b6 b7 b8 tio n vector data used in SIMD parallelism that the processing is to be operated upon mul- SoA SIMDization scheme 1 processor with SoA SIMDization scheme tiple independent data sets, just as Figure 1a suggests.

SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance

Efficient Parallel Algorithms for 3D Laplacian Smoothing on The

Understanding the Performance of HPC Applications

Best Practice Guide Modern Processors

Kernel Optimization by Layout Restructuring

Soft Vector Processors with Streaming Pipelines

LLAMA: the Low Level Abstraction for Memory Access

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual

Broadening the Applicability of FPGA-Based Soft Vector Processors

Intel 64 and IA-32 Architectures Optimization Reference Manual [PDF]

Execution Unit ISA (Ivy Bridge)

Impact of Data Layouts on the Efficiency of GPU-Accelerated IDW Interpolation