Parallel Processing SSC-0114 ARQUITETURA DE COMPUTADORES

Total Page:16

File Type:pdf, Size:1020Kb

Parallel Processing SSC-0114 ARQUITETURA DE COMPUTADORES Parallel Processing SSC-0114 ARQUITETURA DE COMPUTADORES 28/09/2017 Recap 2 Recap (continued) • Tesla P100 x V100 3 Recap (continued) 4 Recap (continued) 5 Recap (continued) 6 Recap (continued) 7 Recap (continued) 8 Recap (continued) 9 Recap (continued) 10 News in architecture … https://www.top500.org/news/intel-take-aim-at-ai-with-new-neuromorphic-chip-1/ 11 News in architecture (continued) https://www.top500.org/news/paderborn-university-will-offer-intel-cpu-fpga-cluster-for-researchers/ 12 Agenda • Parallel Processing • Parallel Computing: Flynn Taxonomy • Parallel Computing –Symmetric multiprocessing –Non-uniform memory access (NUMA) –Clusters 13 Parallel processing Definition It is using multiple processors simultaneously to solve a problem. Purpose Improve performance (shorter run time - reduced time needed to solve a problem) Motivations 1. More complex problems 2. Growing problems 3. Concerns about microelectronics limits 4. Concerns about energy/power consumption 5. …. Classifications ? 14 Flynn’s Taxonomy Mike Flynn, “Very High-Speed Computing Systems,” Proceedings of IEEE, 1966 • SISD • SIMD • MISD • MIMD 15 Flynn’s Taxonomy Mike Flynn, “Very High-Speed Computing Systems,” Proceedings of IEEE, 1966 • SISD • SIMD • MISD • MIMD 16 Flynn’s Taxonomy (continued) Mike Flynn, “Very High-Speed Computing Systems,” Proceedings of IEEE, 1966 • SISD • Single instruction operates on single data element • SIMD Corresponds to the Von Neumann architecture • MISD Includes ILP techniques such as • MIMD superscalar and speculative execution Uniprocessor 17 Flynn’s Taxonomy (continued) Mike Flynn, “Very High-Speed Computing Systems,” Proceedings of IEEE, 1966 • SISD • Single instruction operates on multiple data elements • SIMD Vector processors • MISD GPU (SIMT) • MIMD 18 Flynn’s Taxonomy (continued) Mike Flynn, “Very High-Speed Computing Systems,” Proceedings of IEEE, 1966 • SISD • Multiple instructions operate on single data element • SIMD Systolic arrays • MISD • MIMD 19 Flynn’s Taxonomy (continued) Mike Flynn, “Very High-Speed Computing Systems,” Proceedings of IEEE, 1966 • SISD • Multiple instructions operate on multiple data element • SIMD Multiprocessor • MISD Multithreaded processor • MIMD 20 Multiprogramming and Multiprocessing 21 Multiprogramming and Multiprocessing (continued) SISD architectures 22 Multiprogramming and Multiprocessing (continued) SIMD architectures Concurrency arises from performing the same operations on different pieces of data Same processs 23 Multiprogramming and Multiprocessing (continued) MIMD architectures Concurrency arises from performing different operations on different pieces of data Control/thread parallelism: execute different threads of control in parallel multithreading, multiprocessing 24 Taxonomy of Parallel Processor Architectures 25 Multiple Instruction Multiple Data • Most common parallel hardware architecture today – Example: All many-core processors, clusters, distributed systems • In contrast to a SIMD computer, a MIMD computer can execute different programs on different processors. Flexibility?! 26 MIMD – From software perspective [Gregory F. Pfister, 1993] • SPMD - Single Program Multiple Data – Sometimes denoted as ‘application cluster’ – Examples: Load-balancing cluster or failover cluster for databases, web servers, application servers, ... • MPMD - Multiple Program Multiple Data – Multiple implementations work together on one parallel computation – Example: Master / worker cluster, map / reduce framework 27 MIMD – Main characteristics • MIMD processors work asynchronously, and don’t have to synchronize with each other. – Local clocks can run at higher speeds. • They can be built from commodity (off-the-shelf) microprocessors with relatively little effort. • They are also highly scalable, provided that an appropriate memory organization is used. 28 SIMD vs MIMD • A SIMD computer requires less hardware than a MIMD computer (single control unit). • However, SIMD processors are specially designed, and tend to be expensive and have long design cycles. • Not all applications are naturally suited to SIMD processors. 29 SIMD vs MIMD (continued) • Conceptually, MIMD computers cover also SIMD need. – By letting all processors executing the same program (single program multiple data streams ─ SPMD). • On the other hand, SIMD architectures are more power efficient, since one instruction controls many computations. – For regular vector computation, it is still more appropriate to use a SIMD machine (e.g., implemented in GPUs). 30 MIMD Communications methods 1. Centralized Memory or Shared Memory Architectures Shared memory located at centralized location – consisting usually of several interleaved modules – the same distance from any processor. – Symmetric Multiprocessor (SMP) – Uniform Memory Access (UMA) 31 MIMD Communications methods (continued) 2. Distributed Memory: Memory is distributed to each processor – improving scalability. – Message Passing Architectures: No processor can directly access another processor’s memory. – Hardware Distributed Shared Memory (DSM): Memory is distributed, but the address space is shared. • Non-Uniform Memory Access (NUMA) – Software DSM: A level of OS built on top of message passing multiprocessor to give a shared memory view to the programmer. 32 MIMD with Shared Memory Bottleneck? memory access • Tightly coupled, not very scalable. • Typically called Multi-processor systems. 33 MIMD with Distributed Memory Shared nothing Shared disks RAID • Loosely coupled with message passing, scalable. • Typically called Multi-computer systems. 34 Shared Memory Architectures • All processors act independently, access the same global address space • Changes in one memory location are visible for all others Uniform memory access (UMA) system • Equal load and store access for all processors to all memory • Default approach for majority of SMP (symmetry) systems in the past 35 Shared Memory Architectures (continued) Non-uniform memory access (NUMA) system • Delay on memory access according to the accessed region • Typically realized by processor interconnection network and local memories • Cache-coherent NUMA (CC-NUMA), completely implemented in hardware • About to become standard approach with recent x86 chips 36 NUMA Classification 37 MIMD Approaches • SMP (Symmetric Multiprocessors) • NUMA architecture • Clusters 38 Symmetric Multiprocessors (SMP) • Originally for science, industry, and business where custom multithreaded software could be written • Stand alone computer with the following traits – 2 or more processors of comparable capacity –Processors share same memory and I/O – The processors are connected by a bus or other interconnections. –They share the same memory. • A single memory or a set of memory modules. 39 Symmetric Multiprocessors (continued) • Memory access time is approximately the same for each processor • All processors share access to I/O. – Either through the same channels or different channels giving paths to the same devices. • All processors can perform the same functions (hence symmetric) • System controlled by integrated operating system providing interaction between processors • Performance improvement of SMP will not be a multiple of the number of processors – choked by RAM access and cache coherence 40 Organization of Tightly Coupled Multiprocessor • Individual processors are self-contained, i.e., they have their own control unit, ALU, registers, one or more levels of cache, and private main memory • Access to shared memory and I/O devices through some interconnection network • Processors communicate through memory in common data area 41 Organization of Tightly Coupled Multiprocessor (continued) • Memory is often organized to provide simultaneous access to separate blocks of memory • Bus – Time-shared or common bus – Multiport memory – Central controller (arbitrator) 42 Organization of Tightly Coupled Multiprocessor (continued) 43 Time Shared BUS • Simple extension of single processor bus – Can be likened to relationship between DMA and single processor • Following features provided – Addressing – distinguish modules on bus – Arbitration • Any module can be temporary master • Must have an arbitration scheme – Time sharing – if one module has the bus, others must wait and may have to suspend • Now have multiple processors as well as multiple I/O modules on a single bus 44 Time Shared BUS (continued) • Waiting for bus creates bottleneck – Limited number of processors can be managed • How to improve memory access? CPU CPU CPU CPU Memory 45 SMP with BUS time sharing Time Shared BUS (continued) • Local CPU cache – Decrease bus usage • More processors can be used: 16 or 32 (TANENBAUM; GOODMAN, 1999). • How to deal with cache coherence? CPU CPU CPUCPU CPU CPU Memory 46 SMP with BUS time sharing + CACHE Time Shared BUS (continued) 47 Time Shared BUS (continued) • Advantages – Simplicity – not only is it easy to understand, but its use has been well-tested with DMA – Flexibility – adding processor involves simple addition of processor to bus – Reliability – As long as arbitration does not involve single controller, then there is no single point of failure 48 Time Shared BUS (continued) • Disavantages – Waiting for bus creates bottleneck • Can be helped with individual caches • Usually L1 and L2 – Cache coherence policy must be used (usually hardware) 49 Multiport Memory • Direct independent access of memory modules by each processor and I/O module • Memory handles "arbitration" – Logic internal to memory required to resolve conflicts
Recommended publications
  • Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction
    Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstromt, Truman Joe, and Anoop Gupta Computer Systems Laboratory Stanford University, CA 94305 Abstract machine. Currently, many research groups are pursuing the design and construction of such multiprocessors [12, 1, 10]. Two interesting variations of large-scale shared-memory ma- As research has progressed in this area, two interesting vari- chines that have recently emerged are cache-coherent mm- ants have emerged, namely CC-NUMA (cache-coherent non- umform-memory-access machines (CC-NUMA) and cache- uniform memory access machines) and COMA (cache-only only memory architectures (COMA). They both have dis- memory architectures). Examples of the CC-NUMA ma- tributed main memory and use directory-based cache coher- chines are the Stanford DASH multiprocessor [12] and the MIT ence. Unlike CC-NUMA, however, COMA machines auto- Alewife machine [1], while examples of COMA machines are matically migrate and replicate data at the main-memoty level the Swedish Institute of Computer Science’s Data Diffusion in cache-line sized chunks. This paper compares the perfor- Machine (DDM) [10] and Kendall Square Research’s KSR1 mance of these two classes of machines. We first present a machine [4]. qualitative model that shows that the relative performance is Common to both CC-NUMA and COMA machines are the primarily determined by two factors: the relative magnitude features of distributed main memory, scalable interconnection of capacity misses versus coherence misses, and the gramr- network, and directory-based cache coherence. Distributed hirity of data partitions in the application. We then present main memory and scalable interconnection networks are es- quantitative results using simulation studies for eight prtraUeI sential in providing the required scalable memory bandwidth, applications (including all six applications from the SPLASH while directory-based schemes provide cache coherence with- benchmark suite).
    [Show full text]
  • Detailed Cache Coherence Characterization for Openmp Benchmarks
    1 Detailed Cache Coherence Characterization for OpenMP Benchmarks Anita Nagarajan, Jaydeep Marathe, Frank Mueller Dept. of Computer Science, North Carolina State University, Raleigh, NC 27695-7534 [email protected], phone: (919) 515-7889 Abstract models in their implementation (e.g., [6], [13], [26], [24], [2], [7]), and they operate at different abstraction levels Past work on studying cache coherence in shared- ranging from cycle-accuracy over instruction-level to the memory symmetric multiprocessors (SMPs) concentrates on operating system interface. On the performance tuning end, studying aggregate events, often from an architecture point work mostly concentrates on program analysis to derive op- of view. However, this approach provides insufficient infor- timized code (e.g., [15], [27]). Recent processor support for mation about the exact sources of inefficiencies in paral- performance counters opens new opportunities to study the lel applications. For SMPs in contemporary clusters, ap- effect of applications on architectures with the potential to plication performance is impacted by the pattern of shared complement them with per-reference statistics obtained by memory usage, and it becomes essential to understand co- simulation. herence behavior in terms of the application program con- In this paper, we concentrate on cache coherence simu- structs — such as data structures and source code lines. lation without cycle accuracy or even instruction-level sim- The technical contributions of this work are as follows. ulation. We constrain ourselves to an SPMD programming We introduce ccSIM, a cache-coherent memory simulator paradigm on dedicated SMPs. Specifically, we assume the fed by data traces obtained through on-the-fly dynamic absence of workload sharing, i.e., only one application runs binary rewriting of OpenMP benchmarks executing on a on a node, and we enforce a one-to-one mapping between Power3 SMP node.
    [Show full text]
  • 2.5 Classification of Parallel Computers
    52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor.
    [Show full text]
  • Increasing Memory Miss Tolerance for SIMD Cores
    Increasing Memory Miss Tolerance for SIMD Cores ∗ David Tarjan, Jiayuan Meng and Kevin Skadron Department of Computer Science University of Virginia, Charlottesville, VA 22904 {dtarjan, jm6dg,skadron}@cs.virginia.edu ABSTRACT that use a single instruction multiple data (SIMD) organi- Manycore processors with wide SIMD cores are becoming a zation can amortize the area and power overhead of a single popular choice for the next generation of throughput ori- frontend over a large number of execution backends. For ented architectures. We introduce a hardware technique example, we estimate that a 32-wide SIMD core requires called “diverge on miss” that allows SIMD cores to better about one fifth the area of 32 individual scalar cores. Note tolerate memory latency for workloads with non-contiguous that this estimate does not include the area of any intercon- memory access patterns. Individual threads within a SIMD nection network among the MIMD cores, which often grows “warp” are allowed to slip behind other threads in the same supra-linearly with the number of cores [18]. warp, letting the warp continue execution even if a subset of To better tolerate memory and pipeline latencies, many- core processors typically use fine-grained multi-threading, threads are waiting on memory. Diverge on miss can either 1 increase the performance of a given design by up to a factor switching among multiple warps, so that active warps can of 3.14 for a single warp per core, or reduce the number of mask stalls in other warps waiting on long-latency events. warps per core needed to sustain a given level of performance The drawback of this approach is that the size of the regis- from 16 to 2 warps, reducing the area per core by 35%.
    [Show full text]
  • Chapter 5 Multiprocessors and Thread-Level Parallelism
    Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism Copyright © 2012, Elsevier Inc. All rights reserved. 1 Contents 1. Introduction 2. Centralized SMA – shared memory architecture 3. Performance of SMA 4. DMA – distributed memory architecture 5. Synchronization 6. Models of Consistency Copyright © 2012, Elsevier Inc. All rights reserved. 2 1. Introduction. Why multiprocessors? Need for more computing power Data intensive applications Utility computing requires powerful processors Several ways to increase processor performance Increased clock rate limited ability Architectural ILP, CPI – increasingly more difficult Multi-processor, multi-core systems more feasible based on current technologies Advantages of multiprocessors and multi-core Replication rather than unique design. Copyright © 2012, Elsevier Inc. All rights reserved. 3 Introduction Multiprocessor types Symmetric multiprocessors (SMP) Share single memory with uniform memory access/latency (UMA) Small number of cores Distributed shared memory (DSM) Memory distributed among processors. Non-uniform memory access/latency (NUMA) Processors connected via direct (switched) and non-direct (multi- hop) interconnection networks Copyright © 2012, Elsevier Inc. All rights reserved. 4 Important ideas Technology drives the solutions. Multi-cores have altered the game!! Thread-level parallelism (TLP) vs ILP. Computing and communication deeply intertwined. Write serialization exploits broadcast communication
    [Show full text]
  • Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device
    Analysis and Optimization of I/O Cache Coherency Strategies for SoC-FPGA Device Seung Won Min Sitao Huang Mohamed El-Hadedy Electrical and Computer Engineering Electrical and Computer Engineering Electrical and Computer Engineering University of Illinois University of Illinois California State Polytechnic University Urbana, USA Urbana, USA Ponoma, USA [email protected] [email protected] [email protected] Jinjun Xiong Deming Chen Wen-mei Hwu IBM T.J. Watson Research Center Electrical and Computer Engineering Electrical and Computer Engineering Yorktown Heights, USA University of Illinois University of Illinois [email protected] Urbana, USA Urbana, USA [email protected] [email protected] Abstract—Unlike traditional PCIe-based FPGA accelerators, However, choosing the right I/O cache coherence method heterogeneous SoC-FPGA devices provide tighter integrations for different applications is a challenging task because of between software running on CPUs and hardware accelerators. it’s versatility. By choosing different methods, they can intro- Modern heterogeneous SoC-FPGA platforms support multiple I/O cache coherence options between CPUs and FPGAs, but these duce different types of overheads. Depending on data access options can have inadvertent effects on the achieved bandwidths patterns, those overheads can be amplified or diminished. In depending on applications and data access patterns. To provide our experiments, we find using different I/O cache coherence the most efficient communications between CPUs and accelera- methods can vary overall application execution times at most tors, understanding the data transaction behaviors and selecting 3.39×. This versatility not only makes designers hard to decide the right I/O cache coherence method is essential.
    [Show full text]
  • Chapter 5 Thread-Level Parallelism
    Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated + Wide-issue processors are very complex + Wide-issue processors consume a lot of power + Steady progress in parallel software : the major obstacle to parallel processing 2 Flynn’s Classification of Parallel Architectures According to the parallelism in I and D stream • Single I stream , single D stream (SISD): uniprocessor • Single I stream , multiple D streams(SIMD) : same I executed by multiple processors using diff D – Each processor has its own data memory – There is a single control processor that sends the same I to all processors – These processors are usually special purpose 3 • Multiple I streams, single D stream (MISD) : no commercial machine • Multiple I streams, multiple D streams (MIMD) – Each processor fetches its own instructions and operates on its own data – Architecture of choice for general purpose mps – Flexible: can be used in single user mode or multiprogrammed – Use of the shelf µprocessors 4 MIMD Machines 1. Centralized shared memory architectures – Small #’s of processors (≈ up to 16-32) – Processors share a centralized memory – Usually connected in a bus – Also called UMA machines ( Uniform Memory Access) 2. Machines w/physically distributed memory – Support many processors – Memory distributed among processors – Scales the mem bandwidth if most of the accesses are to local mem 5 Figure 5.1 6 Figure 5.2 7 2. Machines w/physically distributed memory (cont) – Also reduces the memory latency – Of course interprocessor communication is more costly and complex – Often each node is a cluster (bus based multiprocessor) – 2 types, depending on method used for interprocessor communication: 1.
    [Show full text]
  • SIMD Extensions
    SIMD Extensions PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 12 May 2012 17:14:46 UTC Contents Articles SIMD 1 MMX (instruction set) 6 3DNow! 8 Streaming SIMD Extensions 12 SSE2 16 SSE3 18 SSSE3 20 SSE4 22 SSE5 26 Advanced Vector Extensions 28 CVT16 instruction set 31 XOP instruction set 31 References Article Sources and Contributors 33 Image Sources, Licenses and Contributors 34 Article Licenses License 35 SIMD 1 SIMD Single instruction Multiple instruction Single data SISD MISD Multiple data SIMD MIMD Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously. Thus, such machines exploit data level parallelism. History The first use of SIMD instructions was in vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC, which could operate on a vector of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector-processing architectures are now considered separate from SIMD machines, based on the fact that vector machines processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD machines process all elements of the vector simultaneously.[1] The first era of modern SIMD machines was characterized by massively parallel processing-style supercomputers such as the Thinking Machines CM-1 and CM-2. These machines had many limited-functionality processors that would work in parallel.
    [Show full text]
  • Computer Hardware Architecture Lecture 4
    Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data.
    [Show full text]
  • Massively Parallel Computing with CUDA
    Massively Parallel Computing with CUDA Antonino Tumeo Politecnico di Milano 1 GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs. Jack Dongarra Professor, University of Tennessee; Author of “Linpack” Why Use the GPU? • The GPU has evolved into a very flexible and powerful processor: • It’s programmable using high-level languages • It supports 32-bit and 64-bit floating point IEEE-754 precision • It offers lots of GFLOPS: • GPU in every PC and workstation What is behind such an Evolution? • The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) • So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • The fast-growing video game industry exerts strong economic pressure that forces constant innovation GPUs • Each NVIDIA GPU has 240 parallel cores NVIDIA GPU • Within each core 1.4 Billion Transistors • Floating point unit • Logic unit (add, sub, mul, madd) • Move, compare unit • Branch unit • Cores managed by thread manager • Thread manager can spawn and manage 12,000+ threads per core 1 Teraflop of processing power • Zero overhead thread switching Heterogeneous Computing Domains Graphics Massive Data GPU Parallelism (Parallel Computing) Instruction CPU Level (Sequential
    [Show full text]
  • CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors
    NJIT Computer Science Dept CS650 Computer Architecture CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors and PC Clustering Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 10: Intro to Multiprocessors/Clustering 1/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Key Issues Run PhotoShop on 1 PC or N PCs Programmability • How to program a bunch of PCs viewed as a single logical machine. Performance Scalability - Speedup • Run PhotoShop on 1 PC (forget the specs of this PC) • Run PhotoShop on N PCs • Will it run faster on N PCs? Speedup = ? Lecture 10: Intro to Multiprocessors/Clustering 2/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Types of Multiprocessors Key: Data and Instruction Single Instruction Single Data (SISD) • Intel processors, AMD processors Single Instruction Multiple Data (SIMD) • Array processor • Pentium MMX feature Multiple Instruction Single Data (MISD) • Systolic array • Special purpose machines Multiple Instruction Multiple Data (MIMD) • Typical multiprocessors (Sun, SGI, Cray,...) Single Program Multiple Data (SPMD) • Programming model Lecture 10: Intro to Multiprocessors/Clustering 3/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Shared-Memory Multiprocessor Processor Prcessor Prcessor Interconnection network Main Memory Storage I/O Lecture 10: Intro to Multiprocessors/Clustering 4/15 12/7/2003 A. Sohn NJIT Computer Science Dept CS650 Computer Architecture Distributed-Memory Multiprocessor Processor Processor Processor IO/S MM IO/S MM IO/S MM Interconnection network IO/S MM IO/S MM IO/S MM Processor Processor Processor Lecture 10: Intro to Multiprocessors/Clustering 5/15 12/7/2003 A.
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]