Modeling Non-Uniform Memory Access on Large Compute Nodes

Total Page:16

File Type:pdf, Size:1020Kb

Modeling Non-Uniform Memory Access on Large Compute Nodes Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel Jeannot, Leonel Sousa To cite this version: Nicolas Denoyelle, Brice Goglin, Aleksandar Ilic, Emmanuel Jeannot, Leonel Sousa. Modeling Non- Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model. IEEE Transactions on Parallel and Distributed Systems, Institute of Electrical and Electronics Engineers, 2019, 30 (6), pp.1374–1389. 10.1109/TPDS.2018.2883056. hal-01924951 HAL Id: hal-01924951 https://hal.inria.fr/hal-01924951 Submitted on 16 Nov 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 Modeling Non-Uniform Memory Access on Large Compute Nodes with the Cache-Aware Roofline Model Nicolas Denoyelle∗z, Brice Goglin∗, Aleksandar Ilicy, Emmanuel Jeannot∗x, Leonel Sousay ∗ Inria – Bordeaux - Sud-Ouest, Univ. Bordeaux, France fnicolas.denoyelle, brice.goglin, [email protected] y INESC-ID, Instituto Superior Tecnico,´ Universidade de Lisboa, Lisbon, Portugal faleksandar.ilic, [email protected] z Atos Bull Technologies – 38130, Echirolles, France x LaBRI - Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux - Sud-Ouest Abstract—NUMA platforms, emerging memory architectures with on-package high bandwidth memories bring new opportunities and challenges to bridge the gap between computing power and memory performance. Heterogeneous memory machines feature several performance trade-offs, depending on the kind of memory used, when writing or reading it. Finding memory performance upper-bounds subject to such trade-offs aligns with the numerous interests of measuring computing system performance. In particular, representing applications performance with respect to the platform performance bounds has been addressed in the state-of-the-art Cache-Aware Roofline Model (CARM) to troubleshoot performance issues. In this paper, we present a Locality-Aware extension (LARM) of the CARM to model NUMA platforms bottlenecks, such as contention and remote access. On top of this, the new contribution of this paper is the design and validation of a novel hybrid memory bandwidth model. This new hybrid model quantifies the achievable bandwidth upper-bound under above-described trade-offs with less than 3% error. Hence, when comparing applications performance with the maximum attainable performance, software designers can now rely on more accurate information. Index Terms—Roofline Model, Cache-Aware Roofline Model, Heterogeneous Memory, Knights Landing, Skylake, Platform Modeling, Benchmarking, NUMA, Multi-core/single-chip multiprocessors, Shared Memory, Memory Bandwidth F 1 INTRODUCTION HE increasing demands of current applications, both in in different ways and motivate the joint study of both T terms of computation and amount of data to be manip- systems. ulated, and the modest improvements in the performance of Additionally, each cluster of the KNL may feature tradi- processing cores have led to the development of large multi- tional DDR memory as well as 3D-stacked high-bandwidth core and many-core systems [1]. These platforms embed memory named MCDRAM, which can be used as hardware- complex memory hierarchies, spanning from registers to managed cache or additional software-managed memory. private and shared caches, local main memory, and mem- Managing heterogeneous memories in runtime systems ory accessed remotely through interconnection networks. In brings another level of complexity and makes performance these systems, memory throughput is not uniform, since analysis harder and even more necessary. Hence, being they embed several kinds of memories and the distance able to understand the impact of the memory hierarchy between processor and memories varies. On such Non- and core layout on application performance, as well as on Uniform Memory Access (NUMA) architectures, the way attainable performance upper-bounds, is of most impor- data is allocated and accessed has a significant impact on tance and interest. This is especially true when modeling performance [2]. the architecture and tuning applications to take advantage Recently, the latest Intel Xeon Phi processor, code-name of the architecture characteristics in order to improve per- Knights Landing (KNL) [3], traces the NUMA roadmap with formance. As a reference point, on-chip bandwidth varies a processor organized into 4 Sub-NUMA Clusters (SNC- from several orders of magnitude (' 20 times) between 4 mode). Usually, NUMA platforms include several sock- the caches closest to the cores and the local main memory ets interconnected with processor-specific links (e.g. Quick (cf. Section 4). Moreover, purely remote memory access Path Interconnect [4]) or by custom switches, such as SGI across a tested QPI link delivers half of the local memory NUMAlink or Bull Coherent Switch [5]. However, the KNL throughput, thus pushing further the data transfer time. interconnects NUMA clusters at the chip scale (through a 2D Additionally, increasing the number of concurrent requests mesh of up to 36 dual-core tiles). Though the software may on a single memory also increases the contention and can see both types of systems as similar homogeneous NUMA slow down the perceived local memory bandwidth down to trees, the strong architectural differences between NUMA '46% of its maximum value on a tested Broadwell system sockets and KNL chips, can impact application performance (cf. Section 4). On the KNL system with 64 cores, the same 2 memory access pattern decreases the bandwidth down to ' 25% of its maximum value (cf. Section 5). Hence, it fpeak is clear that the memory access pattern has a significant Application with good locality impact on the delivered throughput. Finally, unlike cross Same application with worse locality QPI bandwidth, our experiments in Section 5 show a nearly uniform bandwidth on the interconnection network of the KNL chip. Therefore, it is obvious that the chip layout has a Cache BandwidthMemory * AI BandwidthRemote Bandwidth * AI * AI significant impact on the achievable performance. [GFlops/s] Performance Tuning application performance and inferring their abil- Arithmetic Intensity [Flops/Byte] ity to fully exploit the capabilities of those complex systems require to model and acquire knowledge about their re- Fig. 1: CARM chart of a hypothetical compute node com- alistically achievable performance upper-bounds and their posed of one cache level. individual components (including the different levels of memory hierarchy and the interconnection network). The Cache-Aware Roofline Model [6] (CARM) has been recently tion 6, discusses the main limitations of the model, instanti- proposed –by some of the authors of this paper- as an ates and validates a heterogeneous bandwidth model, able insightful model and an associated methodology aimed to overcome such limitations. Section 7 gives an overview at visually aiding performance characterization and opti- of the state-of-the-art related works, while Section 8 draws mization of applications running on systems with a cache the final conclusion of this research work. memory subsystem. CARM has been integrated by Intel into their proprietary tools, and it is described as “an incredibly useful diagnosis tool (that can guide the developers in the ap- 2 LOCALITY AWARE ROOFLINE MODELING plication optimization process), ensuring that they can squeeze the maximum performance out of their code with minimal time In general, the Roofline modeling [8] is an insightful ap- and effort.”1 However, the CARM refers to systems based on proach to represent the performance upper-bounds of a pro- a single-socket computational node with uniform memory cessor micro-architecture. Since computations and memory access, which does not exhibit the NUMA effects that can transfers are simultaneously performed, this model is based also significantly impact performance. on the assumption that the overall execution time can be To address these issues, we have proposed, firstly in limited either by the time to perform computations or by our previous contribution [7] extended in this paper, a the time to transfer data. Hence, from the micro-architecture new methodology to enhance the CARM unsightliness and perspective, the overall performance can be limited either provide locality hints for application optimization on con- by the peak performance of the computational units or by temporary large shared memory systems, such as multi- the capabilities of the memory system (i.e. bandwidth). socket NUMA systems and Many Integrated Core proces- To model the performance limits of contemporary multi- sors with heterogeneous memory technologies and multiple core systems, the Cache-Aware Roofline Model (CARM) [6] hardware configurations [7]. However, we noticed that some explicitly considers both the throughput of computational highly optimized synthetic benchmarks are not correctly engine and the realistically achievable bandwidth of each characterized yet by our model. Though they are designed memory
Recommended publications
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Chapter 5 Multiprocessors and Thread-Level Parallelism
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication
    [Show full text]
  • Chapter 5 Thread-Level Parallelism
    Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated + Wide-issue processors are very complex + Wide-issue processors consume a lot of power + Steady progress in parallel software : the major obstacle to parallel processing 2 Flynn’s Classification of Parallel Architectures According to the parallelism in I and D stream • Single I stream , single D stream (SISD): uniprocessor • Single I stream , multiple D streams(SIMD) : same I executed by multiple processors using diff D – Each processor has its own data memory – There is a single control processor that sends the same I to all processors – These processors are usually special purpose 3 • Multiple I streams, single D stream (MISD) : no commercial machine • Multiple I streams, multiple D streams (MIMD) – Each processor fetches its own instructions and operates on its own data – Architecture of choice for general purpose mps – Flexible: can be used in single user mode or multiprogrammed – Use of the shelf µprocessors 4 MIMD Machines 1. Centralized shared memory architectures – Small #’s of processors (≈ up to 16-32) – Processors share a centralized memory – Usually connected in a bus – Also called UMA machines ( Uniform Memory Access) 2. Machines w/physically distributed memory – Support many processors – Memory distributed among processors – Scales the mem bandwidth if most of the accesses are to local mem 5 Figure 5.1 6 Figure 5.2 7 2. Machines w/physically distributed memory (cont) – Also reduces the memory latency – Of course interprocessor communication is more costly and complex – Often each node is a cluster (bus based multiprocessor) – 2 types, depending on method used for interprocessor communication: 1.
    [Show full text]
  • Trafficdb: HERE's High Performance Shared-Memory Data Store
    TrafficDB: HERE’s High Performance Shared-Memory Data Store Ricardo Fernandes Piotr Zaczkowski Bernd Gottler¨ HERE Global B.V. HERE Global B.V. HERE Global B.V. [email protected] [email protected] [email protected] Conor Ettinoffe Anis Moussa HERE Global B.V. HERE Global B.V. conor.ettinoff[email protected] [email protected] ABSTRACT HERE's traffic-aware services enable route planning and traf- fic visualisation on web, mobile and connected car appli- cations. These services process thousands of requests per second and require efficient ways to access the information needed to provide a timely response to end-users. The char- acteristics of road traffic information and these traffic-aware services require storage solutions with specific performance features. A route planning application utilising traffic con- gestion information to calculate the optimal route from an origin to a destination might hit a database with millions of queries per second. However, existing storage solutions are not prepared to handle such volumes of concurrent read Figure 1: HERE Location Cloud is accessible world- operations, as well as to provide the desired vertical scalabil- wide from web-based applications, mobile devices ity. This paper presents TrafficDB, a shared-memory data and connected cars. store, designed to provide high rates of read operations, en- abling applications to directly access the data from memory. Our evaluation demonstrates that TrafficDB handles mil- HERE is a leader in mapping and location-based services, lions of read operations and provides near-linear scalability providing fresh and highly accurate traffic information to- on multi-core machines, where additional processes can be gether with advanced route planning and navigation tech- spawned to increase the systems' throughput without a no- nologies that helps drivers reach their destination in the ticeable impact on the latency of querying the data store.
    [Show full text]
  • Cache-Aware Roofline Model: Upgrading the Loft
    1 Cache-aware Roofline model: Upgrading the loft Aleksandar Ilic, Frederico Pratas, and Leonel Sousa INESC-ID/IST, Technical University of Lisbon, Portugal ilic,fcpp,las @inesc-id.pt f g Abstract—The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%. Index Terms—Multicore computer architectures, Performance modeling, Application optimization F 1 INTRODUCTION in Tab. 1. The horizontal part of the Roofline model cor- Driven by the increasing complexity of modern applica- responds to the compute bound region where the Fp can tions, microprocessors provide a huge diversity of compu- be achieved. The slanted part of the model represents the tational characteristics and capabilities. While this diversity memory bound region where the performance is limited is important to fulfill the existing computational needs, it by BD. The ridge point, where the two lines meet, marks imposes challenges to fully exploit architectures’ potential. the minimum I required to achieve Fp. In general, the A model that provides insights into the system performance original Roofline modeling concept [10] ties the Fp and the capabilities is a valuable tool to assist in the development theoretical bandwidth of a single memory level, with the I and optimization of applications and future architectures.
    [Show full text]
  • CS 575: the Roofline Model
    CS 575: The Roofline Model Wim Bohm Colorado State University Spring 2013 Roofline: An Insightful Visual Performance Model.. David Patterson.. U.C. Berkeley submitted to CAM April 2008 Analyzing Program Performance n In empirical Computer Science, we plot functions describing the run time (or the memory use) of a program: ¨ This can be as a function of the input size. We have seen this in e.g. cs320 or cs420, where we studied polynomial and exponential (monotonically growing) complexity. ¨ In this class we also study program performance as a function of the number of processors. n In this case the functions are positive and, hopefully decreasing. n Also we plot speedup curves, which are usually asymptotic ¨ The roofline model plots GFlops/second as a function of Operational Intensity (GFlops/byte) Straight Lines CS475: When plotting data we get the most information from straight lines! ¨ We can easily recognize a straight line (y = ax+b) n The slope (a) and y intercept (b) tells us all. ¨ So we need to turn our data sets into straight lines. ¨ This is easiest done using log-s, because they turn a multiplicative factor into a shift up (y intercept) , and an exponential into a multiplicative factor (slope) Exponential functions log(2n) = n log2 linear in n log(3n) = n log 3 slope is base of log log(4.3n) = n log3 + log4 * shifts it up log((3n)/4) = n log3 – log4 / shifts it down Exponentials: semi-log plot n 2n 3n 20*3n 10000000" 0 1 1 20 1000000" 1 2 3 60 2 4 9 180 100000" 3 8 27 540 10000" 4 16 81 1620 2^n" 3^n" 5 32 243 4860 1000" 20"3^n" 7 128 2087 41740 10 1024 56349 1126980 100" 10" semi-log plot: y–axis on log scale 1" x-axis linear 0" 2" 4" 6" 8" 10" 12" angle: base shift: multiplicative factor Polynomials n What if we take the log of a polynomial? e.g.
    [Show full text]
  • Scheduling of Shared Memory with Multi - Core Performance in Dynamic Allocator Using Arm Processor
    VOL. 11, NO. 9, MAY 2016 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences ©2006-2016 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com SCHEDULING OF SHARED MEMORY WITH MULTI - CORE PERFORMANCE IN DYNAMIC ALLOCATOR USING ARM PROCESSOR Venkata Siva Prasad Ch. and S. Ravi Department of Electronics and Communication Engineering, Dr. M.G.R. Educational and Research Institute University, Chennai, India E-Mail: [email protected] ABSTRACT In this paper we proposed Shared-memory in Scheduling designed for the Multi-Core Processors, recent trends in Scheduler of Shared Memory environment have gained more importance in multi-core systems with Performance for workloads greatly depends on which processing tasks are scheduled to run on the same or different subset of cores (it’s contention-aware scheduling). The implementation of an optimal multi-core scheduler with improved prediction of consequences between the processor in Large Systems for that we executed applications using allocation generated by our new processor to-core mapping algorithms on used ARM FL-2440 Board hardware architecture with software of Linux in the running state. The results are obtained by dynamic allocation/de-allocation (Reallocated) using lock-free algorithm to reduce the time constraints, and automatically executes in concurrent components in parallel on multi-core systems to allocate free memory. The demonstrated concepts are scalable to multiple kernels and virtual OS environments using the runtime scheduler in a machine and showed an averaged improvement (up to 26%) which is very significant. Keywords: scheduler, shared memory, multi-core processor, linux-2.6-OpenMp, ARM-Fl2440.
    [Show full text]
  • Beyond the Roofline: Cache-Aware Power and Energy-Efficiency
    This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TC.2016.2582151 TRANSACTIONS ON COMPUTERS, VOL. X, NO. X, MAY 2016 1 Beyond the Roofline: Cache-aware Power and Energy-Efficiency Modeling for Multi-cores Aleksandar Ilic, Member, IEEE, Frederico Pratas, Member, IEEE, and Leonel Sousa, Senior Member, IEEE Abstract—To foster the energy-efficiency in current and future multi-core processors, the benefits and trade-offs of a large set of optimization solutions must be evaluated. For this purpose, it is often crucial to consider how key micro-architecture aspects, such as accessing different memory levels and functional units, affect the attainable power and energy consumption. To ease this process, we propose a set of insightful cache-aware models to characterize the upper-bounds for power, energy and energy-efficiency of modern multi-cores in three different domains of the processor chip: cores, uncore and package. The practical importance of the proposed models is illustrated when optimizing matrix multiplication and deriving a set of power envelopes and energy-efficiency ranges of the micro-architecture for different operating frequencies. The proposed models are experimentally validated on a computing platform with a quad-core Intel 3770K processor by using hardware counters, on-chip power monitoring facilities and assembly micro-benchmarks. Index Terms—Multicore architectures, Modeling and simulation, Performance evaluation and measurement. F 1 INTRODUCTION N the road to pursuing highly energy-efficient execu- The key contribution of this paper is a set of insight- O tion, current and future trends in computer architec- ful cache-aware models to characterize the upper-bounds ture move towards complex and heterogeneous designs by for power, energy and energy-efficiency of modern multi- incorporating specialized functional units and cores [1].
    [Show full text]
  • The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors
    218 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY 1999 The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors Vijay S. Pai, Student Member, IEEE, Parthasarathy Ranganathan, Student Member, IEEE, Hazim Abdel-Shafi, Student Member, IEEE, and Sarita Adve, Member, IEEE AbstractÐCurrent microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency- hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP- based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication. Index TermsÐShared-memory multiprocessors, instruction-level parallelism, software prefetching, performance evaluation. æ 1INTRODUCTION HARED-MEMORY multiprocessors built from commodity performance improvements from the use of current ILP Smicroprocessors are being increasingly used to provide techniques, but the improvements vary widely.
    [Show full text]
  • Lecture 17: Multiprocessors
    Lecture 17: Multiprocessors • Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections 4.1-4.2) 1 Taxonomy • SISD: single instruction and single data stream: uniprocessor • MISD: no commercial multiprocessor: imagine data going through a pipeline of execution engines • SIMD: vector architectures: lower flexibility • MIMD: most multiprocessors today: easy to construct with off-the-shelf computers, most flexibility 2 Memory Organization - I • Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP) • Multiple processors connected to a single centralized memory – since all processors see the same memory organization Æ uniform memory access (UMA) • Shared-memory because all processors can access the entire memory address space • Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors 3 SMPs or Centralized Shared-Memory Processor Processor Processor Processor Caches Caches Caches Caches Main Memory I/O System 4 Memory Organization - II • For higher scalability, memory is distributed among processors Æ distributed memory multiprocessors • If one processor can directly address the memory local to another processor, the address space is shared Æ distributed shared-memory (DSM) multiprocessor • If memories are strictly local, we need messages to communicate data Æ cluster of computers or multicomputers • Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory 5 Distributed
    [Show full text]
  • Use of Shared Memory in the Context of Embedded Multi-Core Processors: Exploration of the Technology and Its Limits
    ALMA MATER STUDIORUM – UNIVERSITÀ DI BOLOGNA FACOLTÀ DI INGEGNERIA DOTTORATO DI RICERCA IN INGEGNERIA ELETTRONICA, TELECOMUNICAZIONI E TECNOLOGIE DELL’INFORMAZIONE - CICLO XXV SETTORE CONCORSUALE DI AFFERENZA: 09/E3 SETTORE SCIENTIFICO DISCIPLINARE: ING-INF/01 USE OF SHARED MEMORY IN THE CONTEXT OF EMBEDDED MULTI-CORE PROCESSORS: EXPLORATION OF THE TECHNOLOGY AND ITS LIMITS PRESENTATA DA: PAOLO BURGIO COORDINATORE DOTTORATO RELATORI ALESSANDRO VANELLI-CORALLI CHIAR.MO PROF. LUCA BENINI CHIAR.MO PROF. PHILIPPE COUSSY ESAME FINALE ANNO 2013/2014 Use of shared memory in the context of embedded multi-core processor: exploration of the technology and its limits Paolo Burgio Department of Electrical, Electronic and Information Engineering “Guglielmo Marconi” Universit´adegli Studi di Bologna - Italy and Lab-STICC Universit´ede Bretagne-Sud - France A thesis submitted for the degree of PhilosophiæDoctor (PhD) December 2013 ii Abstract Nell’ultimo decennio, i sistemi integrati (embedded systems) hanno sposato le architet- ture many-core. Questo `estato necessario per continuare a soddisfare la richiesta di performance e di bassi consumi da parte del mercato, in particolare da quando smart- phone e tablet sono entrati nella vita di tutti i giorni. Per utilizzare al meglio questa tecnologia, spesso si assume che lo spazio di memoria disponibile sia totalmente condiviso fra i processori. Ci`o `edi grande aiuto sia perch´e semplifica il lavoro del programmatore, sia durante il design della piattaforma e del runtime stessi. Tuttavia, il paradigma a memoria condivisa `enato sui sistemi symmetric multi-processors (SMP), e portarlo sulle moderne architetture embedded pu`oessere problematico. Qui, per ragioni di dimensioni e di consumo di potenza, si tende a rimpiazzare le data cache con memorie scratchpad, che vanno gestite esplicitamente dal software, e quindi dal programmatore.
    [Show full text]
  • GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use
    GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture ‣ Recap of Lecture 2 ‣ Shared memory in detail ‣ Tiling ‣ Bank conflicts ‣ Thread synchronization and atomic operations 2 Recap ‣ Thread hierarchy - Thread are grouped in thread blocks - Threads of the same block are executed on the same SM at the same time ‣ Threads can communicate with shared memory ‣ An SM can have up to 8 blocks at the same time - Thread blocks are divided sequentially into warps of 32 threads each - Threads of the same warp are scheduled together - SM implements a zero-overhead warp scheduling 3 Recap ‣ Memory hierarchy Smart use of memory hierarchy! 4 Recap ‣ Programming model: Finite Difference case - One node per thread - Node indexing automatically groups into thread blocks! 5 Recap ‣ Programming model: Finite Difference case - One node per thread - Node indexing automatically groups into thread blocks! Thread = node Thread block 5 Shared Memory ‣ Small (48kB per SM) ‣ Fast (~4 cycles): On-chip ‣ Private to each block - Allows thread communication ‣ How can we use it? 6 Shared Memory - Making use of it ‣ Looking at a 1D FDM example (similar to lab) ∂u ∂u n+1 n c∆t n n = c ui = ui (ui ui 1) ∂t ∂x − ∆x − − __global__ void update (float *u, float *u_prev, int N, float dx, float dt, float c, int BLOCKSIZE) { // Each thread will load one element int i = threadIdx.x + BLOCKSIZE * blockIdx.x; if (i>=N){return;} // u_prev[i] = u[i] is done in separate kernel if
    [Show full text]