NUMA and GPU So, I Know How to Use MPI and Openmp

Total Page:16

File Type:pdf, Size:1020Kb

NUMA and GPU So, I Know How to Use MPI and Openmp Some hot topics in HPC NUMA and GPU So, I know how to use MPI and OpenMP... is that all ? • (Un)fortunately no Today’s lecture is about two “hot topics” in HPC: • NUMA nodes and thread affinity • GPUs (accelerators) 2 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 3 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 4 / 52 UMA and NUMA (Review) — What’s inside a modern cluster 1. A network 2. Interconnected nodes 3. Nodes with multiple processors/sockets (and accelerators) 4. Processors/sockets with multiple cores 5 / 52 UMA and NUMA (Review) — And what about memory ? From the network point of view: • Each node (a collection of processors) has access to its own memory • The nodes are communicating by sending messages • We called that distributed memory and used MPI to handle it From the node point of view: • The (node’s own) memory is shared among the cores • We called that shared memory and used OpenMP to handle it • Ok, but how is it shared ? −→ Uniform Memory Access (UMA) −→ Non-Uniform Memory Access (NUMA) 6 / 52 UMA and NUMA (Review) — The UMA way Memory c0 c1 c2 c3 c4 c5 c6 c7 The cores and the memory modules are interconnected by a bus Every core can access any part of the memory at the same speed Pros: • No matter where the data are located • No matter where the computations are done Cons: • If the number of core increases the bus has to be faster −→ Does not scale −→ Stuck at around 8 cores on the same memory bus 7 / 52 UMA and NUMA (Review) — The NUMA way I c0 c1 c2 c3 c4 c5 c6 c7 Memory0 Memory1 The cores are split in groups • NUMA nodes Each NUMA node has a fast access to a part of the memory (UMA way) The NUMA nodes are interconnected by a bus (or set of buses) If a core of a NUMA node needs data he does not own: • It “asks” the corresponding NUMA node • Slower mechanism than accessing its own memory 8 / 52 UMA and NUMA (Review) — The NUMA way II c0 c1 c2 c3 c4 c5 c6 c7 Memory0 Memory1 Pros: • Scales Cons: • Data location does matter 9 / 52 UMA and NUMA (Remote access) — The beast I We will use the following machine: %> hwloc-info --no-io --of txt We have two NUMA nodes We have 6 UMA cores per NUMA nodes 10 / 52 UMA and NUMA (Remote access) — The beast II 11 / 52 UMA and NUMA (Remote access) — UNIX and malloc() I Let’s try the following code: %> gcc -O3 firsttouch.c %> ./a. out Time to allocated 100000000 bytes Call to malloc(): 0.000045 [s] First Touch : 0.037014 [s] Second Touch : 0.001181 [s] 12 / 52 UMA and NUMA (Remote access) — UNIX and malloc() II Is it possible to allocate 100 MB in 45 µs ? • Means a memory bandwidth of 2 TB/s −→ Hum. Why are the loops with different timing ? • The first loop is actually allocating the memory malloc() just informs the kernel of future possible allocation Memory is actually allocated by chunk of (usually) 4 KiB (a page) The allocation is done when a page is first touched 13 / 52 UMA and NUMA (Remote access) — First touch policy I In a multithread context it is the first touch policy that is used When a page is first touched by a thread, it is allocated on NUMA node that runs this thread 14 / 52 UMA and NUMA (Remote access) — First touch policy II Let’s try the following code: %> gcc -O3 -fopenmp numacopy.c %> ./a. out Time to copy 80000000 bytes One: 0.009035 [s] Two: 0.017308 [s] Ratio: 1.915637 [-] One: NUMA aware allocation Two: NUMA unaware allocation 15 / 52 UMA and NUMA (Remote access) — First touch policy III NUMA unaware allocation is two times slower Larger NUMA interconnects may have even slower access libnuma may help handling memory access numactl may help for non NUMA aware codes 16 / 52 UMA and NUMA (Remote access) — First touch policy IV numactl can (among other things) allow interleaved allocation %> gcc -O3 -fopenmp numacopy.c %> numactl --interleave=all ./a.out Time to copy 80000000 bytes One: 0.010230 [s] Two: 0.009739 [s] Ratio: 0.952014 [-] One: NUMA aware allocation Two: NUMA unaware allocation 17 / 52 UMA and NUMA (Thread scheduling) — Kernel panic ? Ok, I’m allocating memory with a thread on NUMA node i • Now, this thread has fast access to this segment NUMA node i What if the kernel’s scheduler swaps this thread on NUMA node i + 1 ? • Has to be avoided ! Can be done in OpenMP: • OMP_PROC_BIND = [true | false] • Threads are not allowed to move between processors if set to true More control with POSIX threads: • sched_setaffinity() and sched_getaffinity() • Linux specific: sched.h libnuma can also help 18 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 19 / 52 Cache memory (Review) — You said cache ? Cache memory allows a fast access to a part of the memory Each core has its own cache Memory Cache0 Core0 20 / 52 Cache memory (False sharing) — In parallel ? What happens if two cores are modifying the same cache line ? Cache0: Mem[128-135] Invalid Invalid Cache1: Mem[128-135] Mem[128-135]: Cache may be not coherent any more ! • Synchronization needed −→ Takes time. False sharing • From a software point of view: data are not shared • From a hardware point of view: data are shared (same cache line) 21 / 52 Cache memory (False sharing) — Reduction Let’s try the following code: %> gcc -fopenmp -O3 false.c %> ./a. out Test without padding: 0.001330 [s] Test with padding: 0.000754 [s] Ratio: 1.764073[-] 22 / 52 Cache memory (False sharing) — High level cache Synchronization can be achieved through shared higher level of cache On multi-socket motherboards synchronization must go through RAM 23 / 52 Outline 1 UMA and NUMA Review Remote access Thread scheduling 2 Cache memory Review False sharing 3 GPUs What’s that ? Architecture A first example (CUDA) Let’s get serious Asynchronous copies 24 / 52 GPUs (What’s that ?) — Surely about graphics ! GPU:GraphicsProcessingUnit Handles 3D graphics: • Projection of a 3D scene of 2D plane • Rasterisation of the 2D plane Most modern devices also handles shading, reflections, etc Specialized hardware for intensive work Healthy video game industry pushing this technology • Relatively fair price 25 / 52 GPUs (What’s that ?) — Unreal Engine and Malcom https://en.wikipedia.org/wiki/File:Unreal_Engine_Comparison.jpg c Epic Games 26 / 52 GPUs (What’s that ?) — What about HPC ?I Highly parallel processors • More than 2000 processing units on NVIDIA GeForce GTX 780 Better thermal efficiency than a CPU Cheap 27 / 52 GPUs (What’s that ?) — What about HPC ? II Around 2007 GPUs are using floating point arithmetic (IEEE 754) Introduction of C extensions: • CUDA: Proprietary (NVIDIA) • OpenCL: Open (Khronos Group) 28 / 52 GPUs (Architecture) — CPU vs GPU Let’s look at the chips: • On a CPU control and memory are dominant • On a GPU floating point units are dominant 29 / 52 GPUs (Architecture) — Vocabulary A device is composed of: • The GPU • The GPU own RAM memory • An interface with the motherboard (PCI-Express) A host is composed of: • The CPU • The CPU own RAM memory • The motherboard 30 / 52 GPUs (Architecture) — Inside the GPU A GPU is composed of streaming multiprocessors (SM(X)) • 12 on NVIDIA GeForce GTX 780 A high capacity (from 0.5 GB to 4 GB) RAM is shared among the SMs An SM is composed of streaming processors (SP) • Floating point units • 192 on NVIDIA GeForce GTX 780 • 32 on NVIDIA GeForce GT 430 (single precision) • 16 on NVIDIA GeForce GT 430 (double precision) • 8 on NVIDIA GeForce 8500 GT (single precision only) A SM is also composed of: • Memory units • Control units • Special function units (SFUs) 31 / 52 GPUs (Architecture) — A streaming multiprocessor Instruction cache One instruction fetch Into a SM the same instruction is executed by all the SPs Single Instruction Multiple Threads (SIMT) • SIMD without locality 32 / 52 GPUs (Architecture) — The big picture Host 33 / 52 GPUs (Architecture) — Host job The host: • Allocate and deallocate memory on the device • Send data to the device (synchronously or asynchronously) • Fetch data from the device (synchronously or asynchronously) • Send and execute code on the device −→ This code is called a kernel −→ Calls are asynchronous • The host needs to statically split the threads among the SMs −→ Blocks of threads −→ Blocks distributed among the SMs −→ Kind of dynamism introduced in newer architecture ?? • Good practice to have much more threads than SPs −→ Keeps the SP busy during memory access −→ Kind of Simultaneous MultiThreading (SMT) 34 / 52 GPUs (Architecture) — Device job The host can: • Distribute the blocks of threads among the SMs • Execute the kernel • Handle the send/fetch request of the host at the same time −→ Task parallelism 35 / 52 GPUs (Architecture) — More ? I can keep talking: • Memory limitations • Branching operations • Threads block limitations • ... but, let’s stop for the architecture • I think we have the bases What is important to remember: • Many floating point units • Few “near core” memory • Copies between host and device • SIMT Let’s try some code 36 / 52 GPUs (A first example (CUDA)) — hello, world! Let’s try the vectorial addition c = a + b 37 / 52 GPUs (A first example (CUDA)) — main int main(void){ int N = 1742;// Vector size float *a, *b, *c; // Allocate host memory// a = (float*)malloc(sizeof(float)*N); b = (float*)malloc(sizeof(float)*N);
Recommended publications
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Chapter 5 Multiprocessors and Thread-Level Parallelism
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication
    [Show full text]
  • Desarrollo Del Juego Sky Fighter Mediante XNA 3.1 Para PC
    Departamento de Informática PROYECTO FIN DE CARRERA Desarrollo del juego Sky Fighter mediante XNA 3.1 para PC Autor: Íñigo Goicolea Martínez Tutor: Juan Peralta Donate Leganés, abril de 2011 Proyecto Fin de Carrera Alumno: Íñigo Goicolea Martínez Sky Fighter Tutor: Juan Peralta Donate Agradecimientos Este proyecto es la culminación de muchos meses de trabajo, y de una carrera a la que llevo dedicando más de cinco años. En estas líneas me gustaría recordar y agradecer a todas las personas que me han permitido llegar hasta aquí. En primer lugar a mis padres, Antonio y Lola, por el apoyo que me han dado siempre. Por creer en mí y confiar en que siempre voy a ser capaz de salir adelante y no dudar jamás de su hijo. Y lo mismo puedo decir de mis dos hermanos, Antonio y Manuel. A Juan Peralta, mi tutor, por darme la oportunidad de realizar este proyecto que me ha permitido acercarme más al mundo de los videojuegos, algo en lo que querría trabajar. Pese a que él también estaba ocupado con su tesis doctoral, siempre ha sacado tiempo para resolver dudas y aportar sugerencias. A Sergio, Antonio, Toño, Alberto, Dani, Jorge, Álvaro, Fernando, Marta, Carlos, otro Antonio y Javier. Todos los compañeros, y amigos, que he hecho y que he tenido a lo largo de la carrera y gracias a los cuales he podido llegar hasta aquí. Por último, y no menos importante, a los demás familiares y amigos con los que paso mucho tiempo de mi vida, porque siempre están ahí cuando hacen falta.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • A Data-Driven Approach for Personalized Drama Management
    A DATA-DRIVEN APPROACH FOR PERSONALIZED DRAMA MANAGEMENT A Thesis Presented to The Academic Faculty by Hong Yu In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology August 2015 Copyright © 2015 by Hong Yu A DATA-DRIVEN APPROACH FOR PERSONALIZED DRAMA MANAGEMENT Approved by: Dr. Mark O. Riedl, Advisor Dr. David Roberts School of Interactive Computing Department of Computer Science Georgia Institute of Technology North Carolina State University Dr. Charles Isbell Dr. Andrea Thomaz School of Interactive Computing School of Interactive Computing Georgia Institute of Technology Georgia Institute of Technology Dr. Brian Magerko Date Approved: April 23, 2015 School of Literature, Media, and Communication Georgia Institute of Technology To my family iii ACKNOWLEDGEMENTS First and foremost, I would like to express my most sincere gratitude and appreciation to Mark Riedl, who has been my advisor and mentor throughout the development of my study and research at Georgia Tech. He has been supportive ever since the days I took his Advanced Game AI class and entered the Entertainment Intelligence lab. Thanks to him I had the opportunity to work on the interactive narrative project which turned into my thesis topic. Without his consistent guidance, encouragement and support, this dissertation would never have been successfully completed. I would also like to gratefully thank my dissertation committee, Charles Isbell, Brian Magerko, David Roberts and Andrea Thomaz for their time, effort and the opportunities to work with them. Their expertise, insightful comments and experience in multiple research fields have been really beneficial to my thesis research.
    [Show full text]
  • Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!
    CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over next 2 weeks)! Lecture 23" Introduction to Parallel Processing! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 3! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 4! Processor components! Multicore processors and programming! Processor comparison! vs.! Goal: Explain and articulate why modern microprocessors now have more than one core andCSE how software 30321 must! adapt to accommodate the now prevalent multi- core approach to computing. " Introduction and Overview! Writing more ! efficient code! The right HW for the HLL code translation! right application! University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 6! Pipelining and “Parallelism”! ! Load! Mem! Reg! DM! Reg! ALU ! Instruction 1! Mem! Reg! DM! Reg! ALU ! Instruction 2! Mem! Reg! DM! Reg! ALU ! Instruction 3! Mem! Reg! DM! Reg! ALU ! Instruction 4! Mem! Reg! DM! Reg! ALU Time! Instructions execution overlaps (psuedo-parallel)" but instructions in program issued sequentially." University of Notre Dame! University of Notre Dame! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! Multiprocessing (Parallel) Machines! Flynn#s
    [Show full text]
  • Current Trends in High Performance Computing
    Current Trends in High Performance Computing Chokchai Box Leangsuksun, PhD SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech University [email protected] 1 *SWEPCO endowed professorship is made possible by LA Board of Regents Outline • What is HPC? • Current Trends • More on PS3 and GPU computing • Conclusion 12 December 2011 2 1 Mainstream CPUs • CPU speed – plateaus 3-4 Ghz • More cores in a single chip 3-4 Ghz cap – Dual/Quad core is now – Manycore (GPGPU) • Traditional Applications won’t get a free rides • Conversion to parallel computing (HPC, MT) This diagram is from “no free lunch article in DDJ 12 December 2011 3 New trends in computing • Old & current – SMP, Cluster • Multicore computers – Intel Core 2 Duo – AMD 2x 64 • Many-core accelerators – GPGPU, FPGA, Cell • More Many brains in one computer • Not to increase CPU frequency • Harness many computers – a cluster computing 12/12/11 4 2 What is HPC? • High Performance Computing – Parallel , Supercomputing – Achieve the fastest possible computing outcome – Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc. • Time to insights, Time to discovery, Times to markets 12 December 2011 5 Parallel Programming Concepts Conventional serial execution Parallel execution of a problem where the problem is represented involves partitioning of the problem as a series of instructions that are into multiple executable parts that are executed by the CPU mutually exclusive and collectively exhaustive represented as a partially Problem ordered set exhibiting concurrency.
    [Show full text]
  • System & Service Management
    ©2010, Thomas Galliker www.thomasgalliker.ch System & Service Management Prozessorarchitekturen Multiprocessing and Multithreading Computer architects have become stymied by the growing mismatch in CPU operating frequencies and DRAM access times. None of the techniques that exploited instruction-level parallelism within one program could make up for the long stalls that occurred when data had to be fetched from main memory. Additionally, the large transistor counts and high operating frequencies needed for the more advanced ILP techniques required power dissipation levels that could no longer be cheaply cooled. For these reasons, newer generations of computers have started to exploit higher levels of parallelism that exist outside of a single program or program thread. This trend is sometimes known as throughput computing. This idea originated in the mainframe market where online transaction processing emphasized not just the execution speed of one transaction, but the capacity to deal with massive numbers of transactions. With transaction-based applications such as network routing and web-site serving greatly increasing in the last decade, the computer industry has re-emphasized capacity and throughput issues. One technique of how this parallelism is achieved is through multiprocessing systems, computer systems with multiple CPUs. Once reserved for high-end mainframes and supercomputers, small scale (2-8) multiprocessors servers have become commonplace for the small business market. For large corporations, large scale (16-256) multiprocessors are common. Even personal computers with multiple CPUs have appeared since the 1990s. With further transistor size reductions made available with semiconductor technology advances, multicore CPUs have appeared where multiple CPUs are implemented on the same silicon chip.
    [Show full text]
  • Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers
    Design and Performance Evaluation of a Software Framework for Multi-Physics Simulations on Heterogeneous Supercomputers Entwurf und Performance-Evaluierung eines Software-Frameworks für Multi-Physik-Simulationen auf heterogenen Supercomputern Der Technischen Fakultät der Universität Erlangen-Nürnberg zur Erlangung des Grades Doktor-Ingenieur vorgelegt von Dipl.-Inf. Christian Feichtinger Erlangen, 2012 Als Dissertation genehmigt von der Technischen Fakultät der Universität Erlangen-Nürnberg Tag der Einreichung: 11. Juni 2012 Tag der Promotion: 24. July 2012 Dekan: Prof. Dr. Marion Merklein Berichterstatter: Prof. Dr. Ulrich Rüde Prof. Dr. Takayuki Aoki Prof. Dr. Gerhard Wellein Abstract Despite the experience of several decades the numerical simulation of computa- tional fluid dynamics is still an enormously challenging and active research field. Most simulation tasks of scientific and industrial relevance require the model- ing of multiple physical effects, complex numerical algorithms, and have to be executed on supercomputers due to their high computational demands. Fac- ing these complexities, the reimplementation of the entire functionality for each simulation task, forced by inflexible, non-maintainable, and non-extendable im- plementations is not feasible and bound to fail. The requirements to solve the involved research objectives can only be met in an interdisciplinary effort and by a clean and structured software development process leading to usable, main- tainable, and efficient software designs on all levels of the resulting software framework. The major scientific contribution of this thesis is the thorough design and imple- mentation of the software framework WaLBerla that is suitable for the simulation of multi-physics simulation tasks centered around the lattice Boltzmann method. The design goal of WaLBerla is to be usable, maintainable, and extendable as well as to enable efficient and scalable implementations on massively parallel super- computers.
    [Show full text]
  • A Case for NUMA-Aware Contention Management on Multicore Systems
    A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov Sergey Zhuravlev Mohammad Dashti Simon Fraser University Simon Fraser University Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract performance of individual applications or threads by as much as 80% and the overall workload performance by On multicore systems, contention for shared resources as much as 12% [23]. occurs when memory-intensive threads are co-scheduled Unfortunately studies of contention-aware algorithms on cores that share parts of the memory hierarchy, such focused primarily on UMA (Uniform Memory Access) as last-level caches and memory controllers. Previous systems, where there are multiple shared LLCs, but only work investigated how contention could be addressed a single memory node equipped with the single memory via scheduling. A contention-aware scheduler separates controller, and memory can be accessed with the same competing threads onto separate memory hierarchy do- latency from any core. However, new multicore sys- mains to eliminate resource sharing and, as a conse- tems increasingly use the Non-Uniform Memory Access quence, to mitigate contention. However, all previous (NUMA) architecture, due to its decentralized and scal- work on contention-aware scheduling assumed that the able nature. In modern NUMA systems, there are mul- underlying system is UMA (uniform memory access la- tiple memory nodes, one per memory domain (see Fig- tencies, single memory controller). Modern multicore ure 1). Local nodes can be accessed in less time than re- systems, however, are NUMA, which means that they mote ones, and each node has its own memory controller. feature non-uniform memory access latencies and multi- When we ran the best known contention-aware sched- ple memory controllers.
    [Show full text]
  • Basado En Imágenes Parametrizadas Sobre Resnet Para IdentiCar Intrusiones En 'Smartwatches' U Otros Dispositivos ANes
    IA eñ ™ • Publicaciones de autores 'Framework' basado en imágenes parametrizadas sobre ResNet para identicar intrusiones en 'smartwatches' u otros dispositivos anes. (Un eje singular de la publicación “Estado del arte de la ciencia de datos en el idioma español y su aplicación en el campo de la Inteligencia Articial”) Juan Antonio Lloret Egea, Celia Medina Lloret, Adrián Hernández González, Diana Díaz Raboso, Carlos Campos, Kimberly Riveros Guzmán, Adrián Pérez Herrera, Luis Miguel Cortés Carballo, Héctor Miguel Terrés Lloret License: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC-BY-NC-ND 4.0) IA eñ ™ • Publicaciones de autores aplicación en el campo de la Inteligencia Articial”) Abstracto Se ha definido un framework1 conceptual y algebraicamente, inexistente hasta ahora en su morfología, y pionero en aplicación en el campo de la Inteligencia Artificial (IA) de forma conjunta; e implementado en laboratorio, en sus aspectos más estructurales, como un modelo completamente operacional. Su mayor aportación a la IA a nivel cualitativo es aplicar la conversión o transducción de parámetros obtenidos con lógica ternaria[1] (sistemas multivaluados)2 y asociarlos a una imagen, que es analizada mediante una red residual artificial ResNet34[2],[3] para que nos advierta de una intrusión. El campo de aplicación de este framework va desde smartwaches, tablets y PC's hasta la domótica basada en el estándar KNX[4]. Abstract note The full version of this document in the English language will be available in this link. Código QR de la publicación Este marco propone para la IA una ingeniería inversa de tal modo que partiendo de principios matemáticos conocidos y revisables, aplicados en una imagen gráfica en 2D para detectar intrusiones, sea escrutada la privacidad y la seguridad de un dispositivo mediante la inteligencia artificial para mitigar estas lesiones a los usuarios finales.
    [Show full text]
  • Lecture 2 Parallel Programming Platforms
    Lecture 2 Parallel Programming Platforms Flynn’s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple Instruction stream stream Instruction Single Multiple SISD SIMD Uniprocessors Processor arrays Pipelined vector processors MISD MIMD Systolic arrays Multiprocessors Multicomputers SISD Machine Example: single CPU computers (serial computer) • Single instruction: Only one instruction stream is acted on by CPU during one clock cycle • Single data: Only one data stream is used as input during one clock cycle • Deterministic execution SIMD Machine (I) • A parallel computer • It typically has a single CPU devoted exclusively to control, a large number of subordinate ALUs, each with its own memory and a high- bandwidth internal network. • Control CPU broadcasts an instruction to all subordinate ALUs, and each of the subordinate ALUs either executes the instruction or it is idle. • Example: CM-1, CM-2, IBM9000 SIMD Machine (2) Control CPU ALU 0 ALU 1 ALU p Mem 0 Mem 1 Mem p Interconnection network SIMD Machine (3) From Introduction to Parallel Computing MIMD Machine (I) • Most popular parallel computer architecture • Each processor is a full-fledged CPU with both a control unit and an ALU. Thus each CPU is capable of executing its own program at its own space. • Execution is asynchronous. Processors can also be specifically programmed to synchronize with each other. • Examples: networked parallel computers, symmetric multiprocessor (SMP) computer. MIMD Machine (II) Load A(1) call func Load B(1) X = Y*Z C(1) = A(1)*B(1) Sum = X^2 time Store C(1) call subroutine1(i) … Next instruction Next instruction CPU 0 CPU 1 Further classification according to memory access: • Shared-memory system • Distributed-memory system (Message-passing) Shared-Memory MIMD Machine (I) • Multiple processors can operate independently, but share the same memory resources (a global address space).
    [Show full text]