CIS 501 Computer Architecture This Unit: Shared Memory Multiprocessors Readings Multiplying Performance

Total Page:16

File Type:pdf, Size:1020Kb

CIS 501 Computer Architecture This Unit: Shared Memory Multiprocessors Readings Multiplying Performance This Unit: Shared Memory Multiprocessors App App App •! Thread-level parallelism (TLP) System software •! Shared memory model •! Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU •! Hardware multihreading CPUCPU Computer Architecture •! Multiprocessing •! Synchronization •! Lock implementation Unit 9: Multicore •! Locking gotchas (Shared Memory Multiprocessors) •! Cache coherence •! Bus-based protocols •! Directory protocols •! Memory consistency models CIS 501 (Martin/Roth): Multicore 1 CIS 501 (Martin/Roth): Multicore 2 Readings Multiplying Performance •! H+P •! A single processor can only be so fast •! Chapter 4 •! Limited clock frequency •! Limited instruction-level parallelism •! Limited cache hierarchy •! What if we need even more computing power? •! Use multiple processors! •! But how? •! High-end example: Sun Ultra Enterprise 25k •! 72 UltraSPARC IV+ processors, 1.5Ghz •! 1024 GBs of memory •! Niche: large database servers •! $$$ CIS 501 (Martin/Roth): Multicore 3 CIS 501 (Martin/Roth): Multicore 4 Multicore: Mainstream Multiprocessors Application Domains for Multiprocessors •! Multicore chips •! Scientific computing/supercomputing •! IBM Power5 •! Examples: weather simulation, aerodynamics, protein folding Core 1 Core 2 •! Two 2+GHz PowerPC cores •! Large grids, integrating changes over time •! Shared 1.5 MB L2, L3 tags •! Each processor computes for a part of the grid •! AMD Quad Phenom •! Server workloads •! Four 2+ GHz cores •! Example: airline reservation database 1.5MB L2 •! Per-core 512KB L2 cache •! Many concurrent updates, searches, lookups, queries •! Shared 2MB L3 cache •! Processors handle different requests •! Intel Core 2 Quad •! Media workloads •! Four cores, shared 4 MB L2 •! Processors compress/decompress different parts of image/frames L3 tags •! Two 4MB L2 caches •! Desktop workloads… •! Sun Niagara •! Gaming workloads… Why multicore? What else would •! 8 cores, each 4-way threaded But software must be written to expose parallelism you do with 500 million transistors? •! Shared 2MB L2, shared FP •! For servers, not desktop CIS 501 (Martin/Roth): Multicore 5 CIS 501 (Martin/Roth): Multicore 6 But First, Uniprocessor Concurrency Multithreaded Programming Model •! Software “thread” •! Programmer explicitly creates multiple threads •! Independent flow of execution •! Context state: PC, registers •! All loads & stores to a single shared memory space •! Threads generally share the same memory space •! Each thread has a private stack frame for local variables •! “Process” like a thread, but different memory space •! A “thread switch” can occur at any time •! Java has thread support built in, C/C++ supports P-threads library •! Pre-emptive multithreading by OS •! Generally, system software (the O.S.) manages threads •! Common uses: •! “Thread scheduling”, “context switching” •! Handling user interaction (GUI programming) •! All threads share the one processor •! Handling I/O latency (send network message, wait for response) •! Hardware timer interrupt occasionally triggers O.S. •! Expressing parallel work via Thread-Level Parallelism (TLP) •! Quickly swapping threads gives illusion of concurrent execution •! Much more in CIS380 CIS 501 (Martin/Roth): Multicore 7 CIS 501 (Martin/Roth): Multicore 8 Aside: Hardware Multithreading Simplest Multiprocessor PC PC Regfile0 Regfile PC I$ D$ Regfile1 I$ D$ THR PC •! Hardware Multithreading (MT) Regfile •! Multiple threads dynamically share a single pipeline (caches) • Replicate thread contexts: PC and register file ! •! Replicate entire processor pipeline! •! Coarse-grain MT: switch on L2 misses Why? •! Instead of replicating just register file & PC •! Simultaneous MT: no explicit switching, fine-grain interleaving •! Exception: share caches (we’ll address this bottleneck later) •! Pentium4 is 2-way hyper-threaded, leverages out-of-order core +! MT Improves utilization and throughput •! Same “shared memory” or “multithreaded” model •! Single programs utilize <50% of pipeline (branch, cache miss) •! Loads and stores from two processors are interleaved •! MT does not improve single-thread performance •! Advantages/disadvantages over hardware multithreading? •! Individual threads run as fast or even slower CIS 501 (Martin/Roth): Multicore 9 CIS 501 (Martin/Roth): Multicore 10 Shared Memory Implementations Shared Memory Issues •! Multiplexed uniprocessor •! Three in particular, not unrelated to each other •! Runtime system and/or OS occasionally pre-empt & swap threads •! Interleaved, but no parallelism •! Synchronization •! How to regulate access to shared data? •! Hardware multithreading •! How to implement critical sections? •! Tolerate pipeline latencies, higher efficiency •! Same interleaved shared-memory model •! Cache coherence •! How to make writes to one cache “show up” in others? •! Multiprocessing •! Multiply execution resources, higher peak performance •! Memory consistency model •! Same interleaved shared-memory model • How to keep programmer sane while letting hardware optimize? •! Foreshadowing: allow private caches, further disentangle cores ! •! How to reconcile shared memory with out-of-order execution? •! All have same shared memory programming model CIS 501 (Martin/Roth): Multicore 11 CIS 501 (Martin/Roth): Multicore 12 Example: Parallelizing Matrix Multiply Example: Thread-Level Parallelism my_id() struct acct_t { int bal; }; my_id() my_id() shared struct acct_t accts[MAX_ACCT]; X = int id, amt; 0: addi r1,accts,r3 A B C 1: ld 0(r3),r4 if (accts[id].bal >= amt) { 2: blt r4,r2,6 for (I = 0; I < 100; I++) accts[id].bal -= amt; 3: sub r4,r2,r4 for (J = 0; J < 100; J++) give_cash(); 4: st r4,0(r3) for (K = 0; K < 100; K++) } 5: call give_cash C[I][J] += A[I][K] * B[K][J]; •! How to parallelize matrix multiply over 100 processors? •! Thread-level parallelism (TLP) •! One possibility: give each processor 100 iterations of I •! Collection of asynchronous tasks: not started and stopped together for (J = 0; J < 100; J++) •! Data shared “loosely” (sometimes yes, mostly no), dynamically for (K = 0; K < 100; K++) • Example: database/web server (each query is a thread) C[my_id()][J] += A[my_id()][K] * B[K][J]; ! •! accts is shared, can’t register allocate even if it were scalar •! Each processor runs copy of loop above •! id and amt are private variables, register allocated to r1, r2 •! my_id() function gives each processor ID from 0 to N •! Running example •! Parallel processing library provides this function CIS 501 (Martin/Roth): Multicore 13 CIS 501 (Martin/Roth): Multicore 14 An Example Execution A Problem Execution Time Time Thread 0 Thread 1 Mem Thread 0 Thread 1 Mem 0: addi r1,accts,r3 0: addi r1,accts,r3 500 500 1: ld 0(r3),r4 1: ld 0(r3),r4 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 400 <<< Interrupt >>> 5: call give_cash 0: addi r1,accts,r3 0: addi r1,accts,r3 1: ld 0(r3),r4 1: ld 0(r3),r4 2: blt r4,r2,6 2: blt r4,r2,6 3: sub r4,r2,r4 3: sub r4,r2,r4 4: st r4,0(r3) 300 4: st r4,0(r3) 400 5: call give_cash 5: call give_cash 4: st r4,0(r3) 400 •! Two $100 withdrawals from account #241 at two ATMs 5: call give_cash •! Each transaction maps to thread on different processor •! Problem: wrong account balance! Why? •! Track accts[241].bal (address is in r3) •! Solution: synchronize access to account balance CIS 501 (Martin/Roth): Multicore 15 CIS 501 (Martin/Roth): Multicore 16 Synchronization A Synchronized Execution Time •! Synchronization: a key issue for shared memory Thread 0 Thread 1 Mem call acquire(lock) •! Regulate access to shared data (mutual exclusion) 500 •! Software constructs: semaphore, monitor, mutex 0: addi r1,accts,r3 1: ld 0(r3),r4 •! Low-level primitive: lock 2: blt r4,r2,6 •! Operations: acquire(lock)and release(lock) 3: sub r4,r2,r4 •! Region between acquire and release is a critical section <<< Interrupt >>> call acquire(lock) Spins! •! Must interleave acquire and release <<< Interrupt >>> •! Interfering acquire will block 4: st r4,0(r3) 400 call release(lock) struct acct_t { int bal; }; shared struct acct_t accts[MAX_ACCT]; 5: call give_cash (still in acquire) shared int lock; 0: addi r1,accts,r3 int id, amt; 1: ld 0(r3),r4 acquire(lock); •! Fixed, but how do 2: blt r4,r2,6 if (accts[id].bal >= amt) { accts[id].bal -= amt; // critical section we implement 3: sub r4,r2,r4 give_cash(); } acquire & release? 4: st r4,0(r3) 300 release(lock); 5: call give_cash CIS 501 (Martin/Roth): Multicore 17 CIS 501 (Martin/Roth): Multicore 18 Strawman Lock (Incorrect) Strawman Lock (Incorrect) Time •! Spin lock: software lock implementation Thread 0 Thread 1 Mem A0: ld 0(&lock),r6 •! acquire(lock): while (lock != 0); lock = 1; 0 A1: bnez r6,#A0 A0: ld r6,0(&lock) •! “Spin” while lock is 1, wait for it to turn 0 A2: addi r6,1,r6 A1: bnez r6,#A0 A0: ld 0(&lock),r6 A3: st r6,0(&lock) A2: addi r6,1,r6 1 A1: bnez r6,A0 CRITICAL_SECTION A3: st r6,0(&lock) 1 A2: addi r6,1,r6 CRITICAL_SECTION A3: st r6,0(&lock) •! release(lock): lock = 0; •! Spin lock makes intuitive sense, but doesn’t actually work R0: st r0,0(&lock) // r0 holds 0 •! Loads/stores of two acquire sequences can be interleaved •! Lock acquire sequence also not atomic •! Same problem as before! •! Note, release is trivially atomic CIS 501 (Martin/Roth): Multicore 19 CIS 501 (Martin/Roth): Multicore 20 A Correct Implementation: SYSCALL Lock Better Spin Lock: Use Atomic Swap ACQUIRE_LOCK: •! ISA provides an atomic lock acquisition instruction A1: disable_interrupts atomic •! Example: atomic swap A2: ld r6,0(&lock) swap r1,0(&lock) mov r1->r2 A3: bnez
Recommended publications
  • Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction
    Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstromt, Truman Joe, and Anoop Gupta Computer Systems Laboratory Stanford University, CA 94305 Abstract machine. Currently, many research groups are pursuing the design and construction of such multiprocessors [12, 1, 10]. Two interesting variations of large-scale shared-memory ma- As research has progressed in this area, two interesting vari- chines that have recently emerged are cache-coherent mm- ants have emerged, namely CC-NUMA (cache-coherent non- umform-memory-access machines (CC-NUMA) and cache- uniform memory access machines) and COMA (cache-only only memory architectures (COMA). They both have dis- memory architectures). Examples of the CC-NUMA ma- tributed main memory and use directory-based cache coher- chines are the Stanford DASH multiprocessor [12] and the MIT ence. Unlike CC-NUMA, however, COMA machines auto- Alewife machine [1], while examples of COMA machines are matically migrate and replicate data at the main-memoty level the Swedish Institute of Computer Science’s Data Diffusion in cache-line sized chunks. This paper compares the perfor- Machine (DDM) [10] and Kendall Square Research’s KSR1 mance of these two classes of machines. We first present a machine [4]. qualitative model that shows that the relative performance is Common to both CC-NUMA and COMA machines are the primarily determined by two factors: the relative magnitude features of distributed main memory, scalable interconnection of capacity misses versus coherence misses, and the gramr- network, and directory-based cache coherence. Distributed hirity of data partitions in the application. We then present main memory and scalable interconnection networks are es- quantitative results using simulation studies for eight prtraUeI sential in providing the required scalable memory bandwidth, applications (including all six applications from the SPLASH while directory-based schemes provide cache coherence with- benchmark suite).
    [Show full text]
  • Detailed Cache Coherence Characterization for Openmp Benchmarks
    1 Detailed Cache Coherence Characterization for OpenMP Benchmarks Anita Nagarajan, Jaydeep Marathe, Frank Mueller Dept. of Computer Science, North Carolina State University, Raleigh, NC 27695-7534 [email protected], phone: (919) 515-7889 Abstract models in their implementation (e.g., [6], [13], [26], [24], [2], [7]), and they operate at different abstraction levels Past work on studying cache coherence in shared- ranging from cycle-accuracy over instruction-level to the memory symmetric multiprocessors (SMPs) concentrates on operating system interface. On the performance tuning end, studying aggregate events, often from an architecture point work mostly concentrates on program analysis to derive op- of view. However, this approach provides insufficient infor- timized code (e.g., [15], [27]). Recent processor support for mation about the exact sources of inefficiencies in paral- performance counters opens new opportunities to study the lel applications. For SMPs in contemporary clusters, ap- effect of applications on architectures with the potential to plication performance is impacted by the pattern of shared complement them with per-reference statistics obtained by memory usage, and it becomes essential to understand co- simulation. herence behavior in terms of the application program con- In this paper, we concentrate on cache coherence simu- structs — such as data structures and source code lines. lation without cycle accuracy or even instruction-level sim- The technical contributions of this work are as follows. ulation. We constrain ourselves to an SPMD programming We introduce ccSIM, a cache-coherent memory simulator paradigm on dedicated SMPs. Specifically, we assume the fed by data traces obtained through on-the-fly dynamic absence of workload sharing, i.e., only one application runs binary rewriting of OpenMP benchmarks executing on a on a node, and we enforce a one-to-one mapping between Power3 SMP node.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Chapter 5 Multiprocessors and Thread-Level Parallelism
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication
    [Show full text]
  • Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device
    Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device Seung Won Min Sitao Huang Mohamed El-Hadedy Electrical and Computer Engineering Electrical and Computer Engineering Electrical and Computer Engineering University of Illinois University of Illinois California State Polytechnic University Urbana, USA Urbana, USA Ponoma, USA [email protected] [email protected] [email protected] Jinjun Xiong Deming Chen Wen-mei Hwu IBM T.J. Watson Research Center Electrical and Computer Engineering Electrical and Computer Engineering Yorktown Heights, USA University of Illinois University of Illinois [email protected] Urbana, USA Urbana, USA [email protected] [email protected] Abstract—Unlike traditional PCIe-based FPGA accelerators, However, choosing the right I/O cache coherence method heterogeneous SoC-FPGA devices provide tighter integrations for different applications is a challenging task because of between software running on CPUs and hardware accelerators. it’s versatility. By choosing different methods, they can intro- Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these duce different types of overheads. Depending on data access options can have inadvertent effects on the achieved bandwidths patterns, those overheads can be amplified or diminished. In depending on applications and data access patterns. To provide our experiments, we find using different I/O cache coherence the most efficient communications between CPUs and accelera- methods can vary overall application execution times at most tors, understanding the data transaction behaviors and selecting 3.39×. This versatility not only makes designers hard to decide the right I/O cache coherence method is essential.
    [Show full text]
  • Chapter 5 Thread-Level Parallelism
    Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated + Wide-issue processors are very complex + Wide-issue processors consume a lot of power + Steady progress in parallel software : the major obstacle to parallel processing 2 Flynn’s Classification of Parallel Architectures According to the parallelism in I and D stream • Single I stream , single D stream (SISD): uniprocessor • Single I stream , multiple D streams(SIMD) : same I executed by multiple processors using diff D – Each processor has its own data memory – There is a single control processor that sends the same I to all processors – These processors are usually special purpose 3 • Multiple I streams, single D stream (MISD) : no commercial machine • Multiple I streams, multiple D streams (MIMD) – Each processor fetches its own instructions and operates on its own data – Architecture of choice for general purpose mps – Flexible: can be used in single user mode or multiprogrammed – Use of the shelf µprocessors 4 MIMD Machines 1. Centralized shared memory architectures – Small #’s of processors (≈ up to 16-32) – Processors share a centralized memory – Usually connected in a bus – Also called UMA machines ( Uniform Memory Access) 2. Machines w/physically distributed memory – Support many processors – Memory distributed among processors – Scales the mem bandwidth if most of the accesses are to local mem 5 Figure 5.1 6 Figure 5.2 7 2. Machines w/physically distributed memory (cont) – Also reduces the memory latency – Of course interprocessor communication is more costly and complex – Often each node is a cluster (bus based multiprocessor) – 2 types, depending on method used for interprocessor communication: 1.
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • Trafficdb: HERE's High Performance Shared-Memory Data Store
    TrafficDB: HERE’s High Performance Shared-Memory Data Store Ricardo Fernandes Piotr Zaczkowski Bernd Gottler¨ HERE Global B.V. HERE Global B.V. HERE Global B.V. [email protected] [email protected] [email protected] Conor Ettinoffe Anis Moussa HERE Global B.V. HERE Global B.V. conor.ettinoff[email protected] [email protected] ABSTRACT HERE's traffic-aware services enable route planning and traf- fic visualisation on web, mobile and connected car appli- cations. These services process thousands of requests per second and require efficient ways to access the information needed to provide a timely response to end-users. The char- acteristics of road traffic information and these traffic-aware services require storage solutions with specific performance features. A route planning application utilising traffic con- gestion information to calculate the optimal route from an origin to a destination might hit a database with millions of queries per second. However, existing storage solutions are not prepared to handle such volumes of concurrent read Figure 1: HERE Location Cloud is accessible world- operations, as well as to provide the desired vertical scalabil- wide from web-based applications, mobile devices ity. This paper presents TrafficDB, a shared-memory data and connected cars. store, designed to provide high rates of read operations, en- abling applications to directly access the data from memory. Our evaluation demonstrates that TrafficDB handles mil- HERE is a leader in mapping and location-based services, lions of read operations and provides near-linear scalability providing fresh and highly accurate traffic information to- on multi-core machines, where additional processes can be gether with advanced route planning and navigation tech- spawned to increase the systems' throughput without a no- nologies that helps drivers reach their destination in the ticeable impact on the latency of querying the data store.
    [Show full text]
  • Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2787123, IEEE Transactions on Parallel and Distributed Systems IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. X, MONTH YEAR 1 Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Fellow, IEEE, Miquel Moreto,´ Marc Casas Abstract—Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform comprising 288 cores through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 3.14x to 9.97x and coherence traffic reductions of up to 99% in comparison to NUMA-oblivious scheduling and data allocation. Index Terms—Cache Coherence, NUMA, Task-based programming models F 1 INTRODUCTION HE ccNUMA approach to memory system architecture the data is allocated within the NUMA regions of the system T has become a ubiquitous choice in the design-space [10], [28].
    [Show full text]
  • Thread-Level Parallelism – Part 1
    Chapter 5: Thread-Level Parallelism – Part 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? Performance potential Flynn classification Communication models Architectures Centralized shared-memory Distributed shared-memory Parallel programming Synchronization Memory consistency models What is a parallel or multiprocessor system? Multiple processor units working together to solve the same problem Key architectural issue: Communication model Why parallel architectures? Absolute performance Technology and architecture trends Dennard scaling, ILP wall, Moore’s law Multicore chips Connect multicore together for even more parallelism Performance Potential Amdahl's Law is pessimistic Let s be the serial part Let p be the part that can be parallelized n ways Serial: SSPPPPPP 6 processors: SSP P P P P P Speedup = 8/3 = 2.67 1 T(n) = s+p/n As n → , T(n) → 1 s Pessimistic Performance Potential (Cont.) Gustafson's Corollary Amdahl's law holds if run same problem size on larger machines But in practice, we run larger problems and ''wait'' the same time Performance Potential (Cont.) Gustafson's Corollary (Cont.) Assume for larger problem sizes Serial time fixed (at s) Parallel time proportional to problem size (truth more complicated) Old Serial: SSPPPPPP 6 processors: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Hypothetical Serial: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Speedup = (8+5*6)/8 = 4.75 T'(n) = s + n*p; T'() → !!!! How does your algorithm ''scale up''? Flynn classification Single-Instruction Single-Data
    [Show full text]
  • Parallel Processors and Cache Coherency
    Intro Cache Coherency Notes Computer Architecture II Parallel Processors and Cache Coherency Syed Asad Alam School of Computer Science and Statistics 1/31 Intro Cache Coherency Notes Introduction 2/31 Intro Cache Coherency Notes Flynn’s Taxonomy Introduced by M. J. Flynn SISD → Single processor with a single instruction stream and a single memory (Uniprocessors) SIMD → Single machine instruction with multiple processing elements operating on independent set of data MISD → Multiple processors, each of which execute different instruction sequence on a single sequence of data MIMD → Multiple processors execute different instruction sequences on different data sets 3/31 Intro Cache Coherency Notes Flynn’s Taxonomy Introduced by M. J. Flynn SISD → Single processor with a single instruction stream and a single memory (Uniprocessors) SIMD → Single machine instruction with multiple processing elements operating on independent set of data MISD → Multiple processors, each of which execute different instruction sequence on a single sequence of data MIMD → Multiple processors execute different instruction sequences on different data sets Text sourced from: Computer Organization and Architecture Designing for Performance, William Stallings, Chapter 17 & 18 3/31 Intro Cache Coherency Notes Flynn’s Taxonomy1 1 W. Stallings 4/31 Intro Cache Coherency Notes Flynn’s Taxonomy2 2 W. Stallings 5/31 Intro Cache Coherency Notes Multi-Computer/Cluster – Loosely Coupled 6/31 Intro Cache Coherency Notes Multi-Computer/Cluster – Loosely Coupled 6/31 Intro Cache Coherency Notes Cluster Computer Architecture3 3 W. Stallings 7/31 Intro Cache Coherency Notes Example IITAC Cluster [in Lloyd building] 346 x IBM e326 compute node each with 2 x 2.4GHz 64bit AMD Opteron 250 CPUs, 4GB RAM, ..
    [Show full text]
  • Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures
    1316 IEEE TRANSACTIONSON PARALLEL AND DISTRIBUTEDSYSTEMS, VOL. 6, NO. 12, DECEMBER 1995 , Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring rchitectures Xiaodong Zhang, Senior Member, IEEE, and Yong Yan Abstract-Parallel computing performance on scalable share& large NUMA architectures, the memory system organization memory architectures is affected by the structure of the intercon- also affects communication latency. With respect to the kinds nection networks linking processors to memory modules and on of memory organizations utilized, NUMA memory systems the efficiency of the memory/cache management systems. Cache Coherence Nonuniform Memory Access (CC-NUMA) and Cache can be classified into the following three types in terms of data Only Memory Access (COMA) are two effective memory systems, migration and coherence: and the hierarchical ring structure is an efficient interco~ection Non-CC-NUMA stands for. non-cache-coherentNUMA. network in hardware. This paper focuses on comparative per- formance modeling and evaluation of CC-NUMA and COMA on This type of architecture either supports no local caches a hierarchical ring shared-memory architecture. Analytical mod- at all (e.g., the BBN GPlOOO [I]), or provides a local. els for the two memory systems for comparative evaluation are cache that disallows caching of shared data in order to presented. Intensive performance measurements on data migra- avoid the cache coherence problem (e.g., the BBN tions have been conducted on the KSR-1, a COMA hlerarcbicaI TC2cKm[ 11). ring shared-memory machine. Experimental results support the analytical models, and we present practical observations and CCWUMA stands for cache-coherent NUMA, where comparisons of the two cache coherence memory systems.
    [Show full text]