MIMD & Shared-Nothing Systems

Total Page:16

File Type:pdf, Size:1020Kb

MIMD & Shared-Nothing Systems MIMD & Shared-Nothing Systems Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Multiple Instruction Multiple Data (MIMD) Most common parallel hardware architecture today ■ Example: All many-core processors, clusters, distributed systems From software perspective [Pfister] ■ SPMD - Single Program Multiple Data □ Sometimes denoted as ,application cluster‘ □ Examples: Load-balancing cluster or failover cluster for databases, web servers, application servers, ... ■ MPMD - Multiple Program Multiple Data □ Multiple implementations work together on one parallel computation □ Example: Master / worker cluster, map / reduce framework 2 MIMD Classification 3 Memory Architectures Uniform Memory Access Non-Uniform Memory Access (UMA) (NUMA) Distributed Memory Hybrid 4 Shared Memory vs. Distributed Memory System" ! Shared memory (SM) systems ■ SM-SIMD: Single CPU vector processors ■ SM-MIMD: Multi-CPU vector processors, OpenMP ■ Variant: Clustered shared-memory systems (NEC SX-6, CraySV1ex) Distributed memory (DM) systems ■ DM-SIMD: processor-array machines; lock-step approach; front processor and control processor ■ DM-MIMD: large variety in interconnection networks Distributed (Virtual) shared-memory systems ■ High-Performance Fortran, TreadMarks 5 Shared Memory Architectures! All processors act independently, access the same global address space Changes in one memory location are visible for all others Uniform memory access (UMA) system ■ Equal load and store access for all processors to all memory ■ Default approach for majority of SMP systems in the past Non-uniform memory access (NUMA) system ■ Delay on memory access according to the accessed region ■ Typically realized by processor interconnection network and local memories ■ Cache-coherent NUMA (CC-NUMA), completely implemented in hardware ■ About to become standard approach with recent x86 chips 6 NUMA Classification! 7 MIMD Computer Systems! Sequent Balance 8 Sequent Symmetry! Sequent was bought by IBM in 1999. IBM produced several Intel-based servers based on Sequent’s later NUMA architecture…! 9 Caches – managing bus contention! Effect of write-through and write-back cache coherency protocols on Sequent Symmetry 1 0 Intel Paragon XP/S! i860 RISC processor (64 bit, 50 MHz, 75 MFlops) Standard OS (Mach) on each node Cluster in a box 1 1 Intel " Paragon XP/S –" " interconnection network! 1 2 Intel Paragon XP/S - partitioning! 1 3 IBM SP/2! 1 4 Example: Intel Nehalem SMP System! I/O I/O Core Core Core Core Q Q P P I I Memory Memory L3 Cache Cache L3 L3 Cache Cache L3 Controller Controller Memory Memory Core Core Core Core Core Core Core Core Q Q P P I I Memory Memory L3 Cache Cache L3 Cache L3 Controller Controller Controller Memory Memory Core Core Core Core I/O I/O 15 An Intel Nehalem Cluster:" SMP + NUMA! Network 16 CC-NUMA! Still SMP programming model, but non-NUMA aware software scales poorly Different implementations lead to diffuse understanding of „node“, typical: ■ Region of memory where every byte has the same distance from each processor Tackles scalability problems of pure SMP architectures, while keeping the location independence promise Recent research tendency towards non-cache-coherent NUMA approach (Intel Single Chip Cloud Computer) Processor A Processor B Processor C Processor D Cache Cache Cache Cache High-Speed Memory Memory Interconnect 1 7 Scalable Coherent Interface! ANSI / IEEE standard for NUMA interconnect, used in HPC world ■ 64bit global address space, translation by SCI bus adapter (I/O-window) ■ Used as 2D / 3D torus Processor A Processor B Processor C Processor D Cache Cache Cache Cache Memory Memory SCI Cache SCI Cache SCI Bridge SCI Bridge ... 18 Experimental " Approaches! Systolic Arrays Data flow architectures Problem: common clock – maximum signal path restricted by frequency Fault contention: single faulty processing element will break entire machine 1 9 Parallel Processing 20 ■ Inside the processor □ Instruction-level parallelism (ILP) □ Multicore □ Shared memory ■ With multiple processing elements in one machine □ Multiprocessing □ Shared memory ■ With multiple processing elements in many machines □ Multicomputer □ Shared nothing (in terms of a globally accessible memory) Clusters 21 ■ Collection of stand-alone machines connected by a local network □ Cost-effective technique for a large-scale parallel computer □ Low cost of both hardware and software □ Users are builders, have control over their own system (hardware and software), low costs as major issue ■ Distributed processing as extension of DM-MIMD □ Communication orders of magnitude slower than with SM □ Only feasible for coarse-grained parallel activities Web Server Load Internet Balancer Web Server Web Server Web Server Clusters 22 History of Clusters 23 ■ 1977: ARCnet (Datapoint) □ LAN protocol, DATABUS programming language □ Single computer with terminals □ Addition of ‚compute resource‘ and ‚data resource‘ computers transparent for the application ■ May 1983: VAXCluster (DEC) □ Cluster of VAX computers, no single-point-of-failure □ Every duplicated □ High-speed messaging interconnect Distributed version of VMS OS ■ Distributed lock manager for shared resources History of Clusters - NOW 24 ■ Berkeley Network Of Workstations (NOW) - 1995 ■ Building large-scale parallel computing system with COTS hardware ■ GLUnix operating system □ Transparent remote execution, network PID‘s □ Load balancing □ Virtual Node Numbers (for communication) ■ Network RAM - idle machines as paging device ■ Collection of low-latency, parallel communication primitives - ‘active messages’ ■ Berkeley sockets, shared address space parallel C, MPI Cluster System Classes 25 ■ High-availability (HA) clusters – Improvement of cluster availability □ Linux-HA project (multi-protocol heartbeat, resource grouping) ■ Load-balancing clusters – Server farm for increased performance / availability □ Linux Virtual Server (IP load & application-level balancing) ■ High-performance computing (HPC) clusters – Increased performance by splitting tasks among different nodes □ Speed up the computation of one distributed job (FLOPS) ■ High-throughput computing (HTC) clusters – Maximize the number of finished jobs □ All kinds of simulations, especially parameter sweep □ Special case: Idle Time Computing for cycle harvesting Massively Parallel Processing (MPP) 26 ■ Hierarchical SIMD / MIMD architecture with a lot of processors □ Still standard components (in contrast to mainframes) □ Specialized setup of these components □ Host nodes responsible for loading program and data to PE‘s □ High-performance interconnect (bus, ring, 2D mesh, hypercube, tree, ...) □ For embarrassingly parallel applications, mostly simulations (atom bomb, climate, earthquake, airplane, car crash, ...) ■ Examples □ Distributed Array Processor (1979), 64x64 single bit PEs □ BlueGene/L (2007), 106.496 nodes x 2 PowerPC (700MHz) □ IBM Sequoia (2012), 16,3 PFlops, 1.6 PB memory, 98304 compute nodes, 1.6 Million cores, 7890 kW power 1.1 View from the outside The Blue Gene/P system has the familiar, slanted profile that was introduced with the Blue Gene/L system. However the increased compute power requires an increase in airflow, resulting in a larger footprint. Each of the air plenums on the Blue Gene/P system are just over ten inches wider than the plenums of the previous model. Additionally, each Blue Gene/P rack is approximately four inches wider. There are two additional Bulk Power Modules mounted in the Bulk Power enclosure on the top of the rack. Rather than a circuit breaker style switch, there is an on/off toggle switch to power on the machine. 1.1.1 Packaging FigureBlueGene 1-1 illustrates / theL packaging of the Blue Gene/L system. 27 System 64 Racks, 64x32x32 Rack 32 node cards Node card 180/360 TF/s (32 chips 4x4x2) 32 TB 16 compute, 0-2 IO cards 2.8/5.6 TF/s 512 GB Compute card 2 chips, 1x2x1 90/180 GF/s Chip 16 GB 2 processors 5.6/11.2 GF/s 1.0 GB 2.8/5.6 GF/s 4 MB Figure 1-1 Blue Gene/L packaging Figure 1-2 on page 3 shows how the Blue Gene/P system is packaged. The changes start at the lowest point of the chain. Each chip is made up of four processors rather than just two processors like the Blue Gene/L system supports. At the next level, only one chip is on each of the compute (processor) cards. This design is easier to maintain with less waste. On the Blue Gene/L system, the replacement of a compute node because of a single failed processor requires the discard of one usable chip because the chips are packaged with two per card. The design of the Blue Gene/P system has only one chip per processor card, eliminating the disposal of a good chip when a compute card is replaced. Each node card still has 32 chips, but now the maximum number of I/O nodes per node card is two, so that only two Ethernet ports are on the front of each node card. Like the Blue Gene/L system, there are two midplanes per rack. The lower midplane is considered to be the 2 Evolution of the IBM System Blue Gene Solution Blue Gene/P 28 System Blue Gene/P 72 Racks, 72x32x32 Rack Cabled 8x8x16 32 Node Cards Node Card 1 PF/s (32 chips 4x4x2) 144 (288) TB 32 compute, 0-1 IO cards 13.9 TF/s Compute Card 2 (4) TB 1 chip, 20 435 GF/s DRAMs 64 (128) GB Chip 13.6 GF/s 4 processors 2.0 GB DDR2 (4.0GB 6/30/08) 13.6 GF/s 8 MB EDRAM ©2009 IBM Corporation Blue Gene/Q IBM System Technology Group 29 Blue Gene/Q 4. Node Card: 3. Compute card: 32 Compute Cards, One chip module, Optical Modules, Link Chips; 5D Torus 16 GB DDR3 Memory, Heat Spreader for H2O Cooling 2. Single Chip Module 1. Chip: 16+2 !P cores 5b. IO drawer: 7. System: 8 IO cards w/16 GB 96 racks, 20PF/s 8 PCIe Gen2 x8 slots 3D I/O torus 5a. Midplane: 16 Node Cards •Sustained single node perf: 10x P, 20x L • MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) • Software and hardware support for programming models for exploitation of node hardware concurrency 6.
Recommended publications
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Chapter 5 Multiprocessors and Thread-Level Parallelism
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication
    [Show full text]
  • Chapter 5 Thread-Level Parallelism
    Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated + Wide-issue processors are very complex + Wide-issue processors consume a lot of power + Steady progress in parallel software : the major obstacle to parallel processing 2 Flynn’s Classification of Parallel Architectures According to the parallelism in I and D stream • Single I stream , single D stream (SISD): uniprocessor • Single I stream , multiple D streams(SIMD) : same I executed by multiple processors using diff D – Each processor has its own data memory – There is a single control processor that sends the same I to all processors – These processors are usually special purpose 3 • Multiple I streams, single D stream (MISD) : no commercial machine • Multiple I streams, multiple D streams (MIMD) – Each processor fetches its own instructions and operates on its own data – Architecture of choice for general purpose mps – Flexible: can be used in single user mode or multiprogrammed – Use of the shelf µprocessors 4 MIMD Machines 1. Centralized shared memory architectures – Small #’s of processors (≈ up to 16-32) – Processors share a centralized memory – Usually connected in a bus – Also called UMA machines ( Uniform Memory Access) 2. Machines w/physically distributed memory – Support many processors – Memory distributed among processors – Scales the mem bandwidth if most of the accesses are to local mem 5 Figure 5.1 6 Figure 5.2 7 2. Machines w/physically distributed memory (cont) – Also reduces the memory latency – Of course interprocessor communication is more costly and complex – Often each node is a cluster (bus based multiprocessor) – 2 types, depending on method used for interprocessor communication: 1.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • Massively Parallel Computing with CUDA
    Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • Trafficdb: HERE's High Performance Shared-Memory Data Store
    TrafficDB: HERE’s High Performance Shared-Memory Data Store Ricardo Fernandes Piotr Zaczkowski Bernd Gottler¨ HERE Global B.V. HERE Global B.V. HERE Global B.V. [email protected] [email protected] [email protected] Conor Ettinoffe Anis Moussa HERE Global B.V. HERE Global B.V. conor.ettinoff[email protected] [email protected] ABSTRACT HERE's traffic-aware services enable route planning and traf- fic visualisation on web, mobile and connected car appli- cations. These services process thousands of requests per second and require efficient ways to access the information needed to provide a timely response to end-users. The char- acteristics of road traffic information and these traffic-aware services require storage solutions with specific performance features. A route planning application utilising traffic con- gestion information to calculate the optimal route from an origin to a destination might hit a database with millions of queries per second. However, existing storage solutions are not prepared to handle such volumes of concurrent read Figure 1: HERE Location Cloud is accessible world- operations, as well as to provide the desired vertical scalabil- wide from web-based applications, mobile devices ity. This paper presents TrafficDB, a shared-memory data and connected cars. store, designed to provide high rates of read operations, en- abling applications to directly access the data from memory. Our evaluation demonstrates that TrafficDB handles mil- HERE is a leader in mapping and location-based services, lions of read operations and provides near-linear scalability providing fresh and highly accurate traffic information to- on multi-core machines, where additional processes can be gether with advanced route planning and navigation tech- spawned to increase the systems' throughput without a no- nologies that helps drivers reach their destination in the ticeable impact on the latency of querying the data store.
    [Show full text]
  • A Review of Multicore Processors with Parallel Programming
    International Journal of Engineering Technology, Management and Applied Sciences www.ijetmas.com September 2015, Volume 3, Issue 9, ISSN 2349-4476 A Review of Multicore Processors with Parallel Programming Anchal Thakur Ravinder Thakur Research Scholar, CSE Department Assistant Professor, CSE L.R Institute of Engineering and Department Technology, Solan , India. L.R Institute of Engineering and Technology, Solan, India ABSTRACT When the computers first introduced in the market, they came with single processors which limited the performance and efficiency of the computers. The classic way of overcoming the performance issue was to use bigger processors for executing the data with higher speed. Big processor did improve the performance to certain extent but these processors consumed a lot of power which started over heating the internal circuits. To achieve the efficiency and the speed simultaneously the CPU architectures developed multicore processors units in which two or more processors were used to execute the task. The multicore technology offered better response-time while running big applications, better power management and faster execution time. Multicore processors also gave developer an opportunity to parallel programming to execute the task in parallel. These days parallel programming is used to execute a task by distributing it in smaller instructions and executing them on different cores. By using parallel programming the complex tasks that are carried out in a multicore environment can be executed with higher efficiency and performance. Keywords: Multicore Processing, Multicore Utilization, Parallel Processing. INTRODUCTION From the day computers have been invented a great importance has been given to its efficiency for executing the task.
    [Show full text]
  • Vector Vs. Scalar Processors: a Performance Comparison Using a Set of Computational Science Benchmarks
    Vector vs. Scalar Processors: A Performance Comparison Using a Set of Computational Science Benchmarks Mike Ashworth, Ian J. Bush and Martyn F. Guest, Computational Science & Engineering Department, CCLRC Daresbury Laboratory ABSTRACT: Despite a significant decline in their popularity in the last decade vector processors are still with us, and manufacturers such as Cray and NEC are bringing new products to market. We have carried out a performance comparison of three full-scale applications, the first, SBLI, a Direct Numerical Simulation code from Computational Fluid Dynamics, the second, DL_POLY, a molecular dynamics code and the third, POLCOMS, a coastal-ocean model. Comparing the performance of the Cray X1 vector system with two massively parallel (MPP) micro-processor-based systems we find three rather different results. The SBLI PCHAN benchmark performs excellently on the Cray X1 with no code modification, showing 100% vectorisation and significantly outperforming the MPP systems. The performance of DL_POLY was initially poor, but we were able to make significant improvements through a few simple optimisations. The POLCOMS code has been substantially restructured for cache-based MPP systems and now does not vectorise at all well on the Cray X1 leading to poor performance. We conclude that both vector and MPP systems can deliver high performance levels but that, depending on the algorithm, careful software design may be necessary if the same code is to achieve high performance on different architectures. KEYWORDS: vector processor, scalar processor, benchmarking, parallel computing, CFD, molecular dynamics, coastal ocean modelling All of the key computational science groups in the 1. Introduction UK made use of vector supercomputers during their halcyon days of the 1970s, 1980s and into the early 1990s Vector computers entered the scene at a very early [1]-[3].
    [Show full text]
  • Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
    Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.
    [Show full text]
  • Computer Architecture: Parallel Processing Basics
    Computer Architecture: Parallel Processing Basics Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/9/13 Today What is Parallel Processing? Why? Kinds of Parallel Processing Multiprocessing and Multithreading Measuring success Speedup Amdhal’s Law Bottlenecks to parallelism 2 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi PDL'09 © 2007-9 Goldstein5 Concurrent Systems Embedded-Physical Distributed Sensor Claytronics Networks Geographically Distributed Power Internet Grid Cloud Computing EC2 Tashi Parallel PDL'09 © 2007-9 Goldstein6 Concurrent Systems Physical Geographical Cloud Parallel Geophysical +++ ++ --- --- location Relative +++ +++ + - location Faults ++++ +++ ++++ -- Number of +++ +++ + - Processors + Network varies varies fixed fixed structure Network --- --- + + connectivity 7 Concurrent System Challenge: Programming The old joke: How long does it take to write a parallel program? One Graduate Student Year 8 Parallel Programming Again?? Increased demand (multicore) Increased scale (cloud) Improved compute/communicate Change in Application focus Irregular Recursive data structures PDL'09 © 2007-9 Goldstein9 Why Parallel Computers? Parallelism: Doing multiple things at a time Things: instructions,
    [Show full text]
  • Thread-Level Parallelism I
    Great Ideas in UC Berkeley UC Berkeley Teaching Professor Computer Architecture Professor Dan Garcia (a.k.a. Machine Structures) Bora Nikolić Thread-Level Parallelism I Garcia, Nikolić cs61c.org Improving Performance 1. Increase clock rate fs ú Reached practical maximum for today’s technology ú < 5GHz for general purpose computers 2. Lower CPI (cycles per instruction) ú SIMD, “instruction level parallelism” Today’s lecture 3. Perform multiple tasks simultaneously ú Multiple CPUs, each executing different program ú Tasks may be related E.g. each CPU performs part of a big matrix multiplication ú or unrelated E.g. distribute different web http requests over different computers E.g. run pptx (view lecture slides) and browser (youtube) simultaneously 4. Do all of the above: ú High fs , SIMD, multiple parallel tasks Garcia, Nikolić 3 Thread-Level Parallelism I (3) New-School Machine Structures Software Harness Hardware Parallelism & Parallel Requests Achieve High Assigned to computer Performance e.g., Search “Cats” Smart Phone Warehouse Scale Parallel Threads Computer Assigned to core e.g., Lookup, Ads Computer Core Core Parallel Instructions Memory (Cache) >1 instruction @ one time … e.g., 5 pipelined instructions Input/Output Parallel Data Exec. Unit(s) Functional Block(s) >1 data item @ one time A +B A +B e.g., Add of 4 pairs of words 0 0 1 1 Main Memory Hardware descriptions Logic Gates A B All gates work in parallel at same time Out = AB+CD C D Garcia, Nikolić Thread-Level Parallelism I (4) Parallel Computer Architectures Massive array
    [Show full text]