The Distribution of Opencl Kernel Execution Across Multiple Devices

Total Page:16

File Type:pdf, Size:1020Kb

The Distribution of Opencl Kernel Execution Across Multiple Devices The Distribution of OpenCL Kernel Execution Across Multiple Devices by Steven Gurfinkel A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of the Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Steven Gurfinkel Abstract The Distribution of OpenCL Kernel Execution Across Multiple Devices Steven Gurfinkel Master of Applied Science Graduate Department of the Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto 2014 Many computer systems now include both CPUs and programmable GPUs. OpenCL, a new program- ming framework, can program individual CPUs or GPUs; however, distributing a problem across multiple devices is more difficult. This thesis contributes three OpenCL runtimes that automatically distribute a problem across multiple devices: DualCL and m2sOpenCL, which distribute tasks across a single system's CPU and GPU, and DistCL, which distributes tasks across a cluster's GPUs. DualCL and DistCL run on existing hardware, m2sOpenCL runs in simulation. On a system with a discrete GPU and a system with integrated CPU and GPU devices, running programs from the Rodinia benchmark suite, DualCL improves performance over a single device, when host memory is used. Running simi- lar benchmarks, m2sOpenCL shows that reducing the overheads present in current systems improves performance. DistCL accelerates unmodified compute intense OpenCL kernels, obtained from Rodinia, AMD samples and elsewhere, when distributing them across a cluster. ii Acknowledgements I must start by thanking my supervisors, professors Natalie Enright Jerger and Jason H. Anderson. Without their guidance, care, commitment, and knowledge this work would not have been possible. They always came through for me when I needed some last-minute help and for that I am particularly grateful. It was an honour to be mentored by such great researchers. I must also thank my fellow graduate student Tahir Diop. Our close collaboration was both fun and productive. Furthermore, I thank my committee, professors Tarek S. Abdelrahman, Paul Chow, and Wai Tung Ng for their valuable insights and feedback. Of course, I also thank my family and friends for their support throughout my degree. I thank both my parents for starting me on the path to science and engineering, Their academic achievements serve as an inspiration for my own. iii Contents 1 Introduction 1 1.1 Contributions . .1 1.2 Outline . .2 2 Background 3 2.1 OpenCL . .3 2.1.1 Device Model . .3 2.1.2 Memory Coherence and Consistency . .6 2.1.3 Host Model . .6 2.2 Other Heterogeneous Programming Models . .8 2.2.1 CUDA . .8 2.2.2 BOLT . .8 2.3 The AMD Fusion APU . .8 2.4 Multi2Sim . .8 2.5 Related Work . .9 2.5.1 Distribution with Custom Interfaces . .9 2.5.2 Distribution of OpenCL Kernels . 10 2.5.3 Making Remote Devices Appear Local . 10 2.5.4 Characterizing and Simulating Heterogeneous Execution . 11 3 Co-operative OpenCL Execution 12 3.1 Outline . 13 3.2 Dividing Kernel Execution . 13 3.3 DualCL . 14 3.3.1 Memory Allocation . 15 3.3.2 Partitioning Kernels . 15 3.4 m2sOpenCL . 15 3.5 Partitioning Algorithms . 17 3.5.1 Static Partitioning . 17 3.5.2 Block by Block (block) . 18 3.5.3 NUMA (numa) . 18 3.5.4 First-Past-The-Post (first) . 18 3.5.5 Same Latency (same) . 18 3.6 Experimental Study . 19 iv 3.6.1 Benchmarks . 20 3.7 Results and Discussion . 22 3.7.1 Memory Placement . 22 3.7.2 Partitioning Algorithms . 23 3.7.3 Individual Benchmarks . 26 3.7.4 Simulated Results . 34 3.8 Conclusions . 36 4 CPU Execution for Multi2Sim 38 4.1 Overall Design . 38 4.1.1 The OpenCL Device Model for CPUs . 39 4.2 Behaviour of a Work-Item . 39 4.2.1 Work-Item Stack Layout . 39 4.3 Fibres . 39 4.4 Scheduling and Allocating Work-Items . 41 4.4.1 Stack Allocation Optimization . 42 4.5 Whole-Kernel Execution and Scheduling . 43 4.6 Performance Validation . 43 4.7 Conclusion . 45 5 Distributed OpenCL Execution 46 5.1 Introduction . 46 5.2 DistCL . 47 5.2.1 Partitioning . 48 5.2.2 Dependencies . 48 5.2.3 Scheduling Work . 51 5.2.4 Transferring Buffers . 51 5.3 Experimental Setup . 52 5.3.1 Cluster . 53 5.4 Results and Discussion . 54 5.5 Conclusion . 58 6 Conclusion 60 6.1 Future Work . 61 Appendices 62 A The Accuracy of the Multi2Sim CPU Device 63 A.1 Microbenchmarks . 63 A.1.1 Register . 63 A.1.2 Dependent Add . 63 A.1.3 Independent Add . 64 A.1.4 Memory Throughput . 64 A.1.5 Memory Latency . 65 A.2 Issues and Best Configuration . 65 v B The DistCL and Metafunction Tool Manual 66 B.1 Introduction . 66 B.1.1 Intended Audience . 66 B.2 Building DistCL . 67 B.3 Linking DistCL to a Host Program . 67 B.3.1 Linking Metafunctions . 67 B.4 Running DistCL . 68 B.4.1 distcl.conf . 68 B.5 Writing Metafunctions . 69 B.5.1 Using require region ................................ 70 B.5.2 The Tool . 71 B.5.3 Building the Tool . 74 B.6 Enhancing DistCL . 75 B.6.1 Partition Strategies . 75 B.7 Limitations . ..
Recommended publications
  • Heterogeneous Computing for Advanced Driver Assistance Systems
    TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Lehrstuhl fur¨ Robotik, Kunstliche¨ Intelligenz und Echtzeitsysteme Heterogeneous Computing for Advanced Driver Assistance Systems Xiebing Wang Vollstandiger¨ Abdruck der von der Fakultat¨ der Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaen (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Prof. Dr. Daniel Cremers Prufer¨ der Dissertation: 1. Prof. Dr.-Ing. habil. Alois Knoll 2. Assistant Prof. Xuehai Qian, Ph.D. 3. Prof. Dr. Kai Huang Die Dissertation wurde am 25.04.2019 bei der Technischen Universitat¨ Munchen¨ eingereicht und durch die Fakultat¨ fur¨ Informatik am 17.09.2019 angenommen. Abstract Advanced Driver Assistance Systems (ADAS) is an indispensable functionality in state-of- the-art intelligent cars and the deployment of ADAS in automated driving vehicles would become a standard in the near future. Current research and development of ADAS still faces several problems. First of all, the huge amount of perception data captured by mas- sive vehicular sensors have posed severe computation challenge for the implementation of real-time ADAS applications. Secondly, conventional automotive Electronic Control Units (ECUs) have to cope with the knoy issues such as technology discontinuation and the consequent tedious hardware/soware (HW/SW) maintenance. Lastly, ADAS should be seamlessly shied towards a mixed and scalable system in which safety, security, and real-time critical components must coexist with the less critical counterparts, while next- generation computation resources can still be added exibly so as to provide sucient computing capacity. is thesis gives a systematic study of applying the emerging heterogeneous comput- ing techniques to the design of an automated driving module and the implementation of real-time ADAS applications.
    [Show full text]
  • Parallel Computer Architecture
    Parallel Computer Architecture Introduction to Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 2 – Parallel Architecture Outline q Parallel architecture types q Instruction-level parallelism q Vector processing q SIMD q Shared memory ❍ Memory organization: UMA, NUMA ❍ Coherency: CC-UMA, CC-NUMA q Interconnection networks q Distributed memory q Clusters q Clusters of SMPs q Heterogeneous clusters of SMPs Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 2 Parallel Architecture Types • Uniprocessor • Shared Memory – Scalar processor Multiprocessor (SMP) processor – Shared memory address space – Bus-based memory system memory processor … processor – Vector processor bus processor vector memory memory – Interconnection network – Single Instruction Multiple processor … processor Data (SIMD) network processor … … memory memory Introduction to Parallel Computing, University of Oregon, IPCC Lecture 2 – Parallel Architecture 3 Parallel Architecture Types (2) • Distributed Memory • Cluster of SMPs Multiprocessor – Shared memory addressing – Message passing within SMP node between nodes – Message passing between SMP memory memory nodes … M M processor processor … … P … P P P interconnec2on network network interface interconnec2on network processor processor … P … P P … P memory memory … M M – Massively Parallel Processor (MPP) – Can also be regarded as MPP if • Many, many processors processor number is large Introduction to Parallel Computing, University of Oregon,
    [Show full text]
  • AMD APP SDK V2.8.1
    AMD APP SDK v2.8.1 FAQ 1 General Questions 1. Do I need to use additional software with the SDK? To run an OpenCL™ application, you must have an OpenCL™ runtime on your system. If your system includes a recent AMD discrete GPU, or an APU, you also should install the latest Catalyst™ drivers, which can be downloaded from AMD.com. Information on supported devices can be found at developer.amd.com/appsdk. If your system does not include a recent AMD discrete GPU, or APU, the SDK installs a CPU-only OpenCL™ run-time. Also, we recommend using the debugging profiling and analysis tools contained in the AMD CodeXL heterogeneous compute tools suite. 2. Which versions of the OpenCL™ standard does this SDK support? AMD APP SDK 2.8.1 supports the development of applications using the OpenCL™ Specification v 1.2. 3. Will applications developed to execute on OpenCL™ 1.1 still operate in an OpenCL™ 1.2 environment? OpenCL™ is designed to be backwards compatible. The OpenCL™ 1.2 run-time delivered with the AMD Catalyst drivers run any OpenCL™ 1.1-compliant application. However, an OpenCL™ 1.2-compliant application will not execute on an OpenCL™ 1.1 run-time if APIs only supported by OpenCL™ 1.2 are used. 4. Does AMD provide any additional OpenCL™ samples, other than those contained within the SDK? The most recent versions of all of the samples contained within the SDK are also available for individual download from the developer.amd.com/appsdk “Samples & Demos” page. This page also contains additional samples that either were too large to include in the SDK, or which have been developed since the most recent SDK release.
    [Show full text]
  • AMD APP SDK V2.9.1 Getting Started
    AMD APP SDK v2.9.1 Getting Started 1 Overview The AMD APP SDK is provided to the developer community to accelerate the programming in a heterogeneous environment by enabling AMD GPUs to work in concert with the system's x86 CPU cores. The SDK provides samples, documentation, and other materials to quickly get you started leveraging accelerated compute using OpenCL™, Bolt, OpenCV, C++ AMP for your C/C++ application, or Aparapi for your Java application. This document provides instructions on using the AMD APP SDK. The necessary prerequisite installations, environment settings, build and execute instructions for the samples are provided. Review the following quick links to the important sections: Section 2, “APP SDK on Windows” Section 2.1, “Installation” Section 2.2, “General Prerequisites” Section 2.3, “OpenCL” Section 2.4, “BOLT” Section 2.5, “C++ AMP” Section 2.6, “Aparapi” Section 2.7, “OpenCV” Section 3, “APP SDK on Linux” Section 3.1, “Installation” Section 3.2, “General prerequisites” Section 3.3, “OpenCL” Section 3.4, “BOLT” Section 3.5, “Aparapi” Section 3.6, “OpenCV” Section Appendix A, “Important Notes” Section Appendix C, “CMAKE” Section Appendix D, “Building OpenCV from sources” Getting Started 1 of 19 2 APP SDK on Windows 2.1 Installation The AMD APP SDK 2.9.1 installer is delivered as a self-extracting installer for 32-bit and 64-bit systems on Windows. For details on how to install the APP SDK on Windows, see the AMD APP SDK Installation Notes document. The default installation path is C:\Users\<userName>\AMD APP SDK\<appSdkVersion>\.
    [Show full text]
  • Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2017.2787123, IEEE Transactions on Parallel and Distributed Systems IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. X, MONTH YEAR 1 Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Fellow, IEEE, Miquel Moreto,´ Marc Casas Abstract—Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform comprising 288 cores through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 3.14x to 9.97x and coherence traffic reductions of up to 99% in comparison to NUMA-oblivious scheduling and data allocation. Index Terms—Cache Coherence, NUMA, Task-based programming models F 1 INTRODUCTION HE ccNUMA approach to memory system architecture the data is allocated within the NUMA regions of the system T has become a ubiquitous choice in the design-space [10], [28].
    [Show full text]
  • Thread-Level Parallelism – Part 1
    Chapter 5: Thread-Level Parallelism – Part 1 Introduction What is a parallel or multiprocessor system? Why parallel architecture? Performance potential Flynn classification Communication models Architectures Centralized shared-memory Distributed shared-memory Parallel programming Synchronization Memory consistency models What is a parallel or multiprocessor system? Multiple processor units working together to solve the same problem Key architectural issue: Communication model Why parallel architectures? Absolute performance Technology and architecture trends Dennard scaling, ILP wall, Moore’s law Multicore chips Connect multicore together for even more parallelism Performance Potential Amdahl's Law is pessimistic Let s be the serial part Let p be the part that can be parallelized n ways Serial: SSPPPPPP 6 processors: SSP P P P P P Speedup = 8/3 = 2.67 1 T(n) = s+p/n As n → , T(n) → 1 s Pessimistic Performance Potential (Cont.) Gustafson's Corollary Amdahl's law holds if run same problem size on larger machines But in practice, we run larger problems and ''wait'' the same time Performance Potential (Cont.) Gustafson's Corollary (Cont.) Assume for larger problem sizes Serial time fixed (at s) Parallel time proportional to problem size (truth more complicated) Old Serial: SSPPPPPP 6 processors: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Hypothetical Serial: SSPPPPPP PPPPPP PPPPPP PPPPPP PPPPPP PPPPPP Speedup = (8+5*6)/8 = 4.75 T'(n) = s + n*p; T'() → !!!! How does your algorithm ''scale up''? Flynn classification Single-Instruction Single-Data
    [Show full text]
  • AMD APP SDK V3.0 Beta
    AMD APP SDK v3.0 Beta FAQ 1 General Questions 1. Do I need to use additional software with the SDK? For information about the additional software to be used with the AMD APP SDK, see the AMD APP SDK Getting Started Guide. Also, we recommend using the debugging profiling and analysis tools contained in the AMD CodeXL heterogeneous compute tools suite. 2. Which versions of the OpenCL™ standard does this SDK support? AMD APP SDK version 3.0 Beta supports the development of applications using the OpenCL™ Specification version 2.0. 3. Will applications developed to execute on OpenCL™ 1.2 still operate in an OpenCL™ 2.0 environment? OpenCL™ is designed to be backwards compatible. The OpenCL™ 2.0 run-time delivered with the AMD Catalyst drivers run any OpenCL™ 1.2-compliant application. However, an OpenCL™ 2.0-compliant application will not execute on an OpenCL™ 1.2 run-time if APIs only supported by OpenCL™ 2.0 are used. 4. Does AMD provide any additional OpenCL™ samples, other than those contained within the SDK? The most recent versions of all of the samples contained within the SDK are also available for individual download from the developer.amd.com/appsdk “Samples & Demos” page. This page also contains additional samples that either were too large to include in the SDK, or which have been developed since the most recent SDK release. Check the AMD APP SDK web page for new, updated, or large samples. 5. How often can I expect to get AMD APP SDK updates? Developers can expect that the AMD APP SDK may be updated two to three times a year.
    [Show full text]
  • AMD APP SDK Developer Release Notes
    AMD APP SDK v3.0 Beta Developer Release Notes 1 What’s New in AMD APP SDK v3.0 Beta 1.1 New features in AMD APP SDK v3.0 Beta AMD APP SDK v3.0 Beta includes the following new features: OpenCL 2.0: There are 20 samples that demonstrate various features of OpenCL 2.0 such as Shared Virtual Memory, Platform Atomics, Device-side Enqueue, Pipes, New workgroup built-in functions, Program Scope Variables, Generic Address Space, and OpenCL 2.0 image features. For the complete list of the samples, see the AMD APP SDK Samples Release Notes (AMD_APP_SDK_Release_Notes_Samples.pdf) document. Support for Bolt 1.3 library. 6 additional samples that demonstrate various APIs in the Bolt C++ AMP library. One new sample that demonstrates the consumption of SPIR 1.2 binary. Enhancements and bug fixes in several samples. A lightweight installer that supports the following features: Customized online installation Ability to download the full installer for install and distribution 1.2 New features for AMD CodeXL version 1.6 The following new features in AMD CodeXL version 1.6 provide the following improvements to the developer experience: GPU Profiler support for OpenCL 2.0 API-level debugging for OpenCL 2.0 Power Profiling For information about CodeXL and about how to use CodeXL to gather performance data about your OpenCL application, such as application traces and timeline views, see the CodeXL home page. Developer Release Notes 1 of 4 2 Important Notes OpenCL 2.0 runtime support is limited to 64-bit applications running on 64-bit Windows and Linux operating systems only.
    [Show full text]
  • Efficient Synchronization Mechanisms for Scalable GPU Architectures
    Efficient Synchronization Mechanisms for Scalable GPU Architectures by Xiaowei Ren M.Sc., Xi’an Jiaotong University, 2015 B.Sc., Xi’an Jiaotong University, 2012 a thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the faculty of graduate and postdoctoral studies (Electrical and Computer Engineering) The University of British Columbia (Vancouver) October 2020 © Xiaowei Ren, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled: Efficient Synchronization Mechanisms for Scalable GPU Architectures submitted by Xiaowei Ren in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering. Examining Committee: Mieszko Lis, Electrical and Computer Engineering Supervisor Steve Wilton, Electrical and Computer Engineering Supervisory Committee Member Konrad Walus, Electrical and Computer Engineering University Examiner Ivan Beschastnikh, Computer Science University Examiner Vijay Nagarajan, School of Informatics, University of Edinburgh External Examiner Additional Supervisory Committee Members: Tor Aamodt, Electrical and Computer Engineering Supervisory Committee Member ii Abstract The Graphics Processing Unit (GPU) has become a mainstream computing platform for a wide range of applications. Unlike latency-critical Central Processing Units (CPUs), throughput-oriented GPUs provide high performance by exploiting massive application parallelism.
    [Show full text]
  • Press Release
    Wednesday, December 5, 2012 12:54:49 PM Eastern Standard Time Subject: AMD Paves Ease-of-Programming Path to Heterogeneous System Architecture with New APP SDK 2.8 and Unified Developer Tool Suite Date: Tuesday, December 4, 2012 8:04:53 AM Eastern Standard Time From: AMD Communications To: All AMD The following news releases cleared Market Wire at 8:00 a.m. (ET), Tuesday, December 4, 2012. Due to incompatible email and PC platforms, some symbols may not translate and spacing may be off. PRESS RELEASE Contact: Travis Williams AMD Public Relations (512) 602-4863 [email protected] AMD Paves Ease-of-Programming Path to Heterogeneous System Architecture with New APP SDK 2.8 and Unified Developer Tool Suite – New Array of Developer Tool Kits, Suites and Libraries Makes Heterogeneous Compute Programming More Accessible on AMD Platforms – SUNNYVALE, Calif. — Dec. 4, 2012 — AMD (NYSE: AMD) today announced availability of the AMD APP SDK 2.8 and the AMD CodeXL unified tool suite to provide developers the tools and resources needed to accelerate applications with AMD accelerated processing units (APUs) and graphics processing units (GPUs). The APP SDK 2.8 and CodeXL tool suite provides access to code samples, white papers, libraries and tools to leverage the processing power of heterogeneous compute with OpenCL™, C++, DirectCompute and more. “With CodeXL and APP SDK 2.8, our highest performing SDK to date which leaps past the competition in performance on standard benchmarks1, AMD continues to empower developers with the resources they need for greater performance and power-efficient applications,” said Manju Hegde, corporate vice president, Heterogeneous Applications and Developer Solutions, AMD.
    [Show full text]
  • CIS 501 Computer Architecture This Unit: Shared Memory
    This Unit: Shared Memory Multiprocessors App App App • Thread-level parallelism (TLP) System software • Shared memory model • Multiplexed uniprocessor CIS 501 Mem CPUCPU I/O CPUCPU • Hardware multihreading CPUCPU Computer Architecture • Multiprocessing • Synchronization • Lock implementation Unit 9: Multicore • Locking gotchas (Shared Memory Multiprocessors) • Cache coherence Slides originally developed by Amir Roth with contributions by Milo Martin • Bus-based protocols at University of Pennsylvania with sources that included University of • Directory protocols Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. • Memory consistency models CIS 501 (Martin): Multicore 1 CIS 501 (Martin): Multicore 2 Readings Beyond Implicit Parallelism • Textbook (MA:FSPTCM) • Consider “daxpy”: • Sections 7.0, 7.1.3, 7.2-7.4 daxpy(double *x, double *y, double *z, double a): for (i = 0; i < SIZE; i++) • Section 8.2 Z[i] = a*x[i] + y[i]; • Lots of instruction-level parallelism (ILP) • Great! • But how much can we really exploit? 4 wide? 8 wide? • Limits to (efficient) super-scalar execution • But, if SIZE is 10,000, the loop has 10,000-way parallelism! • How do we exploit it? CIS 501 (Martin): Multicore 3 CIS 501 (Martin): Multicore 4 Explicit Parallelism Multiplying Performance • Consider “daxpy”: • A single processor can only be so fast daxpy(double *x, double *y, double *z, double a): • Limited clock frequency for (i = 0; i < SIZE; i++) • Limited instruction-level parallelism Z[i] = a*x[i] + y[i]; • Limited cache hierarchy • Break it
    [Show full text]
  • Shared-Memory Multiprocessors Gates & Transistors
    This Unit: Shared Memory Multiprocessors Application • Three issues OS • Cache coherence Compiler Firmware CIS 501 • Synchronization • Memory consistency Introduction To Computer Architecture CPU I/O Memory • Two cache coherence approaches • “Snooping” (SMPs): < 16 processors Digital Circuits • “Directory”/Scalable: lots of processors Unit 11: Shared-Memory Multiprocessors Gates & Transistors CIS 501 (Martin/Roth): Shared Memory Multiprocessors 1 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 2 Readings Thread-Level Parallelism struct acct_t { int bal; }; • H+P shared struct acct_t accts[MAX_ACCT]; • Chapter 6 int id,amt; 0: addi r1,accts,r3 if (accts[id].bal >= amt) 1: ld 0(r3),r4 { 2: blt r4,r2,6 accts[id].bal -= amt; 3: sub r4,r2,r4 spew_cash(); 4: st r4,0(r3) } 5: call spew_cash • Thread-level parallelism (TLP) • Collection of asynchronous tasks: not started and stopped together • Data shared loosely, dynamically • Example: database/web server (each query is a thread) • accts is shared, can’t register allocate even if it were scalar • id and amt are private variables, register allocated to r1, r2 • Focus on this CIS 501 (Martin/Roth): Shared Memory Multiprocessors 3 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 4 Shared Memory Shared-Memory Multiprocessors • Shared memory • Provide a shared-memory abstraction • Multiple execution contexts sharing a single address space • Familiar and efficient for programmers • Multiple programs (MIMD) • Or more frequently: multiple copies of one program (SPMD) P1 P2 P3 P4 • Implicit (automatic) communication via loads and stores + Simple software • No need for messages, communication happens naturally – Maybe too naturally • Supports irregular, dynamic communication patterns Memory System • Both DLP and TLP – Complex hardware • Must create a uniform view of memory • Several aspects to this as we will see CIS 501 (Martin/Roth): Shared Memory Multiprocessors 5 CIS 501 (Martin/Roth): Shared Memory Multiprocessors 6 Shared-Memory Multiprocessors Paired vs.
    [Show full text]