SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance

Total Page:16

File Type:pdf, Size:1020Kb

SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance SAMS Multi-Layout Memory: Providing Multiple Views of Data to Boost SIMD Performance Chunyang Gou, Georgi Kuzmanov, Georgi N. Gaydadjiev Computer Engineering Lab Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology, The Netherlands {C.Gou, G.K.Kuzmanov, G.N.Gaydadjiev}@tudelft.nl ABSTRACT layout and its continuity in the linear memory address space. We propose to bridge the discrepancy between data represen- It has been found that most SIMDized applications are in fa- tations in memory and those favored by the SIMD proces- vor of operating on SoA format for better performance[18, sor by customizing the low-level address mapping.Toachieve 20]. However, data representation in the system memory is this, we employ the extended Single-Affiliation Multiple-Stride mostly in the form of AoS because of two reasons. First, AoS (SAMS) parallel memory scheme at an appropriate level in the is the natural data representation in many scientific and en- memory hierarchy. This level of memory provides both Ar- gineering applications. Secondly, indirections to structured ray of Structures (AoS) and Structure of Arrays (SoA) views data, such as pointer or indexed array accesses, are in favor for the structured data to the processor, appearing to have of the AoS format. Therefore, a pragmatic problem in the maintained multiple layouts for the same data.Withsuch SIMDization arises: the need for dynamic data format trans- multi-layout memory, optimal SIMDization can be achieved. form between AoS and SoA, which results in significant perfor- Our synthesis results using TSMC 90nm CMOS technology mance degradation. To our best knowledge, no trivial solution indicate that the SAMS Multi-Layout Memory system has ef- for this problem has been previously proposed. Our SAMS ficient hardware implementation, with a critical path delay of Multi-Layout Memory system, presented in this paper, sup- less than 1ns and moderate hardware overhead. Experimen- ports contiguous data access for both AoS and SoA formats. tal evaluation based on a modified IBM Cell processor model The specific contributions of our proposal are: suggests that our approach is able to decrease the dynamic in- • Custom, low-level address mapping logic to manage indi- struction count by up to 49% for a selection of real applications vidual internal data layout and provide efficient memory and kernels. Under the same conditions, the total execution accesses for both AoS and SoA views; time can be reduced by up to 37%. • Categories and Subject Descriptors: C.1.2[Multiple Data Novel hardware/software interface for improved program- Stream Architectures (Multiprocessors)]:SIMD; B.3.2[Design mer productivity and additional performance gains; Styles]:Interleaved memories • The SAMS scheme implementation in TSMC 90nm CMOS General Terms: Design, Performance technology with affordable critical path (<1ns) and its integration into the IBM Cell SPE model; 1. INTRODUCTION • One of the most critical challenges in SIMD processing is Up to 49% improvement in dynamic instruction counts imposed by the data representation. By exploiting explicitly for real applications and kernels, which is translated into expressed data parallelism, SIMD processors tend to provide a 37% reduction of the overall execution time. higher performance for computationally intensive applications The remainder of the paper is organized as follows. In Sec- with lower control overhead compared to superscalar micro- tion 2, we provide the motivation for this work. In Section 3, processors. However, SIMDization suffers from the notorious theoriginalSAMSschemeandtheproposedextensionsare problems of difficult data alignment and arrangement, which briefly described. The hardware implementation and synthe- greatly undermine its potential performance[4, 26, 31]. sis results of the SAMS Multi-Layout Memory system and its In both scientific and commercial applications, data is usu- integration to the IBM Cell SPE are presented in Section 4. ally organized in a structured way. A sequence of structured Simulated performance of the SAMS memory in applications data units could be represented either in AoS (Array of Struc- and kernels is evaluated in Section 5. The major differences be- tures) or in SoA (Structure of Arrays) format. Such data rep- tween our proposal and related art are described in Section 6. resentation predetermines, at the application level, the data Finally, Section 7 concludes the paper. Permission to make digital or hard copies of all or part of this work for 2. MOTIVATION personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies Motivation Example: We shall examine the SIMDization of bear this notice and the full citation on the first page. To copy otherwise, to vector-matrix multiplication, c=a*B,wherea and c are 1x3 republish, to post on servers or to redistribute to lists, requires prior specific vectors and B is a 3x3 matrix with column-major storage. Al- permission and/or a fee. ICS’10, June 2–4, 2010, Tsukuba, Ibaraki, Japan. though the involved computations are quite simple, SIMDizing Copyright 2010 ACM 978-1-4503-0018-6/10/06 ...$10.00. them to achieve optimal speedup is very difficult. Assuming 1 struct vec3 {float a0, float a1, float a2}; struct matrix3 {float b0, float b3, float b6, float b1, float b4, float b7, struct vec3 {float a0, float a1, float a2}; float b2, float b5, float b8}; struct vec3 {float a0[4], float a1[4], struct matrix3 {float b0, float b3, float b6, float b1, float b4, … float a2[4]}; float b7, float b2, float b5, float b8}; struct vec3 a[4], c[4]; struct matrix3 {float b0[4], float … struct matrix3 b[4]; b3[4], float b6[4], float b1[4], float struct vec3 a[4], c[4]; vector float a0, a1, a2, b0, b1, b2, b3, b4, b5, b6, b7, b8, c0, c1, c2; b4[4], float b7[4], float b2[4], float struct matrix3 b[4]; // global addresses in main memory: gaa (for a), gab (b), gac (c) b5[4], float b8[4]}; vector float a0, a1, a2, b0, b1, b2, b3, b4, b5, b6, b7, b8, … … c0, c1, c2; // reading data from main memory to multi-layout memory struct vec3 a, c; … AOS_DMA_GET(a, gaa, 4*sizeof(vec3), tag, vec3); struct matrix3 b; // load a with AoS format, transform it to SoA with shuffles AOS_DMA_GET(b, gab, 4*sizeof(matrix3), tag, matrix3); vector float a0, a1, a2, b0, b1, b2, vector float tmp0 = *(vector float*)a; // load a with SoA view b3, b4, b5, b6, b7, b8, c0, c1, c2; vector float tmp1 = *(vector float*)(&a[1].a1); BEGIN_MULTI_VIEW(vec3); … vector float tmp2 = *(vector float*)(&a[2].a2); a0 = SOA_GET(&a[0].a0); // load a with SoA format a0 = spu_shuffle(tmp0, tmp1, pattern_0360); a1 = SOA_GET(&a[0].a1); a0 = *(vector float*)a.a0; a0 = spu_shuffle(a0, tmp2, pattern_0125); a2 = SOA_GET(&a[0].a2); a1 = *(vector float*)a.a1; a1 = spu_shuffle(tmp1, tmp2, pattern_0360); // load b with SoA view a2 = *(vector float*)a.a2; a1 = spu_shuffle(a0, tmp0, pattern_5012); BEGIN_MULTI_VIEW(matrix3); ... a2 = spu_shuffle(tmp2, tmp0, pattern_0360); b0 = SOA_GET(&b[0].b0); // load b with SoA format a2 = spu_shuffle(a0, tmp1, pattern_2501); b1 = SOA_GET(&b[0].b1); … // load b with AoS format, transform it to SoA with shuffles … // do computation … b8 = SOA_GET(&b[0].b8); c0 = a0*b0+a1*b3+a2*b6; // do computation // do computation c1 = a0*b1+a1*b4+a2*b7; c0=a0*b0+a1*b3+a2*b6 c0=a0*b0+a1*b3+a2*b6 c2 = a0*b2+a1*b5+a2*b8; c1=a0*b1+a1*b4+a2*b7 c1=a0*b1+a1*b4+a2*b7 // store results to c with SoA format c2=a0*b2+a1*b5+a2*b8 c2=a0*b2+a1*b5+a2*b8 *(vector float*)c.a0 = c0; // transform results from SoA to AoS format with shuffles, // store c with SoA view *(vector float*)c.a1 = c1; store them to c BEGIN_MULTI_VIEW(vec3); *(vector float*)c.a2 = c2; ... SOA_PUT(&c[0].a0, c0); SOA_PUT(&c[0].a1, c1); SOA_PUT(&c[0].a2, c2); // writing data from multi-layout memory to main memory AOS_DMA_PUT(c, gac, 4*sizeof(vec3), tag, vec3); SoA storage + AoS storage + SoA SIMDization SoA SIMDization AoS storage + SoA SIMDization + Multi-layout memory Figure 2: Sample vector-matrix multiplication code B[4] 3 3 3 1 a[4] b0 b1 b2 used in some applications to alleviate the data alignment prob- 2 2 2 c[4] = a[4] * B[4] b0 b1 b2 1 1 1 where 3 b0 3 b1b53 b2 i i i i i i i lem, but at the cost of wasted memory space and additional 3 3 3 b3 b4 0 3 0 0 c0 =a0 *b0 +a1 *b3 +a2 *b6 a0 a1 a2 2 b0 2 b1 2 b2 i i i i i i i 2 2 2 b3 b4 b5 c1 =a0 *b1 +a1 *b4 +a2 *b7 a0 a1 a2 1 1 1 i i i i i i i memory bandwidth which can become prohibitively expensive parallelism exploited 1 1 1 3 3 3 c2 =a0 *b2 +a1 *b5 +a2 *b8 a0 a1 a2 b3 b4 b5 (S 0 0 0 b6 b7 0 b8 0 0 (i=0,1,2,3) o 2 b3 2 b4 2 b5 s A SIM a0 a1 a2 b b b for some applications (e.g. the Wilson-Dirac kernel). ch 6 7 8 e 1 1 1 me) b b b D 6 0 7 0 8 0 iz Fortunately, it is common in applications with high data a b6 b7 b8 tio n vector data used in SIMD parallelism that the processing is to be operated upon mul- SoA SIMDization scheme 1 processor with SoA SIMDization scheme tiple independent data sets, just as Figure 1a suggests.
Recommended publications
  • Efficient Parallel Algorithms for 3D Laplacian Smoothing on The
    applied sciences Article Efficient Parallel Algorithms for 3D Laplacian Smoothing on the GPU Lei Xiao 1, Guoxiang Yang 1,*, Kunyang Zhao 2,3 and Gang Mei 1,* 1 School of Engineering and Technology, China University of Geosciences (Beijing), Beijing 100083, China; xiaolei@cugb.edu.cn 2 Airport Engineering Civil Aviation R&D Base, China Airport Construction Group Co., Ltd., Beijing 100621, China; zhaoky@cacc.com.cn 3 China Super-Creative Airport Technical Co., Ltd., Beijing 100621, China * Correspondence: yanggx@cugb.edu.cn (G.Y.); gang.mei@cugb.edu.cn (G.M.); Tel.: +86-10-8232-2627 (G.Y.) Received: 13 November 2019; Accepted: 6 December 2019; Published: 11 December 2019 Abstract: In numerical modeling, mesh quality is one of the decisive factors that strongly affects the accuracy of calculations and the convergence of iterations. To improve mesh quality, the Laplacian mesh smoothing method, which repositions nodes to the barycenter of adjacent nodes without changing the mesh topology, has been widely used. However, smoothing a large-scale three dimensional mesh is quite computationally expensive, and few studies have focused on accelerating the Laplacian mesh smoothing method by utilizing the graphics processing unit (GPU). This paper presents a GPU-accelerated parallel algorithm for Laplacian smoothing in three dimensions by considering the influence of different data layouts and iteration forms. To evaluate the efficiency of the GPU implementation, the parallel solution is compared with the original serial solution. Experimental results show that our parallel implementation is up to 46 times faster than the serial version. Keywords: computational mesh processing; 3D laplacian smoothing; parallel algorithm; graphics processing unit 1.
    [Show full text]
  • Understanding the Performance of HPC Applications
    Understanding the Performance of HPC Applications Brian Gravelle March 2019 Abstract High performance computing is an important asset to scientific research, enabling the study of phe- nomena such as nuclear physics or climate change, that are difficult or impossible to be studied in traditional experiments or allowing researchers to utilize large amounts of data from experiments such as the Large Hadron Collider. No matter the use of HPC, the need for performance is always present; however, the fast-changing nature of computer systems means that software must be continually updated to run efficiently on the newest machines. In this paper, we discuss methods and tools used to understand the performance of an application running on HPC systems and how this understanding can translate into improved performance. We primarily focus on node-level issues, but also mention some of the basic issues involved with multi-node analysis as well. 1 Introduction In the modern world, supercomputing is used to approach many of the most important questions in science and technology. Computer simulations, machine learning programs, and other applications can all take ad- vantage of the massively parallel machines known as supercomputers; however, care must be taken to use these machines effectively. With the advent of multi- and many- core processors, GPGPUs, deep cache struc- tures, and FPGA accelerators, getting the optimal performance out of such machines becomes increasingly difficult. Although many modern supercomputers are theoretically capable of achieving tens or hundreds of petaflops, applications often only reach 15-20% of that peak performance. In this paper, we review recent research surrounding how users can understand and improve the performance of HPC applications.
    [Show full text]
  • Best Practice Guide Modern Processors
    Best Practice Guide Modern Processors Ole Widar Saastad, University of Oslo, Norway Kristina Kapanova, NCSA, Bulgaria Stoyan Markov, NCSA, Bulgaria Cristian Morales, BSC, Spain Anastasiia Shamakina, HLRS, Germany Nick Johnson, EPCC, United Kingdom Ezhilmathi Krishnasamy, University of Luxembourg, Luxembourg Sebastien Varrette, University of Luxembourg, Luxembourg Hayk Shoukourian (Editor), LRZ, Germany Updated 5-5-2021 1 Best Practice Guide Modern Processors Table of Contents 1. Introduction .............................................................................................................................. 4 2. ARM Processors ....................................................................................................................... 6 2.1. Architecture ................................................................................................................... 6 2.1.1. Kunpeng 920 ....................................................................................................... 6 2.1.2. ThunderX2 .......................................................................................................... 7 2.1.3. NUMA architecture .............................................................................................. 9 2.2. Programming Environment ............................................................................................... 9 2.2.1. Compilers ........................................................................................................... 9 2.2.2. Vendor performance libraries
    [Show full text]
  • Kernel Optimization by Layout Restructuring
    THESE` PRESENT´ EE´ A` L’UNIVERSITE´ DE BORDEAUX ECOLE´ DOCTORALE DE MATHEMATIQUES´ ET D’INFORMATIQUE par Christopher Haine POUR OBTENIR LE GRADE DE DOCTEUR SPECIALIT´ E´ : INFORMATIQUE Kernel optimization by layout restructuring Date de soutenance : 3 Juillet 2017 Devant la commission d’examen compos´eede : Henri-Pierre CHARLES . Directeur de Recherche, CEA . Rapporteur Allen MALONY . Professeur, Universite´ de l’Oregon . Rapporteur Emmanuel JEANNOT . Directeur de Recherche, Inria . Examinateur Pascale ROSSE-LAURENT Ingenieur,´ Atos . Examinateur Olivier AUMAGE . Charge´ de Recherche, Inria . Encadrant Denis BARTHOU . Professeur des Universites,´ Bordeaux INP Directeur de These` 2017 2 Abstract Careful data layout design is crucial for achieving high performance, as nowadays processors waste a considerable amount of time being stalled by memory transactions, and in particular spa- cial and temporal locality have to be optimized. However, data layout transformations is an area left largely unexplored by state-of-the-art compilers, due to the difficulty to evaluate the possi- ble performance gains of transformations. Moreover, optimizing data layout is time-consuming, error-prone, and layout transformations are too numerous to be experimented by hand in hope to discover a high performance version. We propose to guide application programmers through data layout restructuring with an ex- tensive feedback, firstly by providing a comprehensive multidimensional description of the initial layout, built via analysis of memory traces collected from the application binary in fine aiming at pinpointing problematic strides at the instruction level, independently of the input language. We choose to focus on layout transformations, translatable to C-formalism to aid user understanding, that we apply and assess on case study composed of two representative multithreaded real-life applications, a cardiac wave simulation and lattice QCD simulation, with different inputs and parameters.
    [Show full text]
  • Soft Vector Processors with Streaming Pipelines
    Soft Vector Processors with Streaming Pipelines A. Severance y; z J. Edwards y H. Omidian z G. Lemieux y; z aaronsev@ece.ubc.ca jedwards@vectorblox.com hosseino@ece.ubc.ca lemieux@ece.ubc.ca y VectorBlox Computing Inc. z Univ. of British Columbia Vancouver, BC Vancouver, BC ABSTRACT DDR2/ S VectorBlox Custom DDR3 C1 C Soft vector processors (SVPs) achieve significant performance MXP Pipe P1 gains through the use of parallel ALUs. However, since D$ M Custom ALUs are used in a time-multiplexed fashion, this does not C2 C Pipe P2 exploit a key strength of FPGA performance: pipeline par- S Scratchpad allelism. This paper shows how streaming pipelines can be Nios II/f Custom integrated into the datapath of a SVP to achieve dramatic C3 C M DMA Pipe P3 speedups. The SVP plays an important role in supplying I$ M Avalon Bus the pipeline with high-bandwidth input data and storing Custom Custom C4 C C Pipe P4 its results using on-chip memory. However, the SVP must Instructions also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or Figure 1: System view of VectorBlox MXP post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is pro- grammed in C, these tasks are easier to develop and debug Custom Instructions DMA and Vector Work Queues, Instruction Decode & Control than using a traditional HDL approach. Using the N-body Address Generation problem as a case study, this paper illustrates how custom ABCD P1 P2 P3 P4 streaming pipelines are integrated into the SVP datapath S R/W Rd Σ and multiple techniques for generating them.
    [Show full text]
  • LLAMA: the Low Level Abstraction for Memory Access
    LLAMA: The Low Level Abstraction For Memory Access Bernhard Manfred Gruber1,2,3, Guilherme Amadio1, Jakob Blomer1, Alexander Matthes4,5, Ren´eWidera4, and Michael Bussmann2 1EP-SFT, CERN, Geneva, Switzerland 2Center for Advanced Systems Understanding, Saxony, Germany 3Technische Universit¨atDresden, Dresden, Germany 4Helmholtz-Zentrum Dresden - Rossendorf, Dresden, Germany 5LogMeIn, Dresden, Germany 31. May 2021 Abstract The performance gap between CPU and memory widens continuously. Choosing the best memory layout for each hardware architecture is increasingly important as more and more programs become memory bound. For portable codes that run across heterogeneous hardware architectures, the choice of the memory layout for data structures is therefore ideally decoupled from the rest of a program. This can be accomplished via a zero-runtime-overhead abstraction layer, underneath which memory layouts can be freely exchanged. We present the C++ library LLAMA, which provides such a data structure ab- straction layer with example implementations for multidimensional arrays of nested, structured data. LLAMA provides fully C++ compliant methods for defining and switching custom memory layouts for user-defined data types. Providing two close- to-life examples, we show that the LLAMA-generated AoS (Array of Struct) and SoA (Struct of Array) layouts produce identical code with the same performance characteristics as manually written data structures. LLAMA's layout-aware copy routines can significantly speed up transfer and reshuffling of data between layouts arXiv:2106.04284v1 [cs.PF] 8 Jun 2021 compared with naive element-wise copying. The library is fully extensible with third-party allocators and allows users to support their own memory layouts with custom mappings.
    [Show full text]
  • Intel(R) 64 and IA-32 Architectures Optimization Reference Manual
    N Intel® 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-020 November 2009 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR IN- TENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITU- ATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "un- defined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The Intel® 64 architecture processors may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Hyper-Threading Technology requires a computer system with an Intel® processor supporting Hyper- Threading Technology and an HT Technology enabled chipset, BIOS and operating system.
    [Show full text]
  • Broadening the Applicability of FPGA-Based Soft Vector Processors
    Broadening the Applicability of FPGA-based Soft Vector Processors by Aaron Severance Bachelor of Science in Computer Engineering, Boston University, 2004 Master of Science in Electrical Engineering, University of Washington, 2006 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES (Electrical and Computer Engineering) The University of British Columbia (Vancouver) March 2015 c Aaron Severance, 2015 Abstract A soft vector processor (SVP) is an overlay on top of FPGAs that allows data- parallel algorithms to be written in software rather than hardware, and yet still achieve hardware-like performance. This ease of use comes at an area and speed penalty, however. Also, since the internal design of SVPs are based largely on custom CMOS vector processors, there is additional opportunity for FPGA-specific optimizations and enhancements. This thesis investigates and measures the effects of FPGA-specific changes to SVPs that improve performance, reduce area, and improve ease-of-use; thereby expanding their useful range of applications. First, we address applications needing only moderate performance such as audio filtering where SVPs need only a small number (one to four) of parallel ALUs. We make implementation and ISA design decisions around the goals of producing a compact SVP that effectively utilizes existing BRAMs and DSP Blocks. The resulting VENICE SVP has 2× better performance per logic block than previous designs. Next, we address performance issues with algorithms where some vector ele- ments ‘exit early’ while others need additional processing. Simple vector predica- tion causes all elements to exhibit ‘worst case’ performance.
    [Show full text]
  • Intel 64 and IA-32 Architectures Optimization Reference Manual [PDF]
    Intel® 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-033 June 2016 Intel technologies features and benefits depend on system configuration and may require enabled hardware, software, or service ac- tivation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses. You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting http://www.intel.com/design/literature.htm.
    [Show full text]
  • Execution Unit ISA (Ivy Bridge)
    Intel® OpenSource HD Graphics Programmer’s Reference Manual (PRM) Volume 4 Part 3: Execution Unit ISA (Ivy Bridge) For the 2012 Intel® Core™ Processor Family May 2012 Revision 1.0 NOTICE: This document contains information on products in the design phase of development, and Intel reserves the right to add or remove product features at any time, with or without changes to this open source documentation. Doc Ref #: IHD-OS-V4 Pt 3 – 05 12 Creative Commons License You are free to Share — to copy, distribute, display, and perform the work Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). No Derivative Works. You may not alter, transform, or build upon this work. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or
    [Show full text]
  • Impact of Data Layouts on the Efficiency of GPU-Accelerated IDW Interpolation
    Mei and Tian SpringerPlus (2016) 5:104 DOI 10.1186/s40064-016-1731-6 RESEARCH Open Access Impact of data layouts on the efficiency of GPU‑accelerated IDW interpolation Gang Mei1,2* and Hong Tian3 *Correspondence: gang. mei@cugb.edu.cn Abstract 1 School of Engineering This paper focuses on evaluating the impact of different data layouts on the compu- and Technolgy, China University of Geosciences, tational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation No. 29 Xueyuan Road, algorithm. First we redesign and improve our previous GPU implementation that was Beijing 100083, China performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we Full list of author information is available at the end of the implement three versions of GPU implementations, i.e., the naive version, the tiled ver- article sion, and the improved CDP version, based upon five data layouts, including the Struc- ture of Arrays (SoA), the Array of Structures (AoS), the Array of aligned Structures (AoaS), the Structure of Arrays of aligned Structures (SoAoS), and the Hybrid layout. We also carry out several groups of experimental tests to evaluate the impact. Experimental results show that: the layouts AoS and AoaS achieve better performance than the layout SoA for both the naive version and tiled version, while the layout SoA is the best choice for the improved CDP version. We also observe that: for the two combined data layouts (the SoAoS and the Hybrid), there are no notable performance gains when compared to other three basic layouts. We recommend that: in practical applications, the layout AoaS is the best choice since the tiled version is the fastest one among three versions.
    [Show full text]