Breadth First Search Vectorization on the Intel Xeon Phi

Total Page:16

File Type:pdf, Size:1020Kb

Breadth First Search Vectorization on the Intel Xeon Phi Breadth First Search Vectorization on the Intel Xeon Phi Mireya Paredes Graham Riley Mikel Luján University of Manchester University of Manchester University of Manchester Oxford Road, M13 9PL Oxford Road, M13 9PL Oxford Road, M13 9PL Manchester, UK Manchester, UK Manchester, UK [email protected] [email protected] [email protected] ABSTRACT Today, scientific experiments in many domains, and large Breadth First Search (BFS) is a building block for graph organizations can generate huge amounts of data, e.g. social algorithms and has recently been used for large scale anal- networks, health-care records and bio-informatics applica- ysis of information in a variety of applications including so- tions [8]. Graphs seem to be a good match for important cial networks, graph databases and web searching. Due to and large dataset analysis as they can abstract real world its importance, a number of different parallel programming networks. To process large graphs, different techniques have models and architectures have been exploited to optimize been applied, including parallel programming. The main the BFS. However, due to the irregular memory access pat- challenge in the parallelization of graph processing is that terns and the unstructured nature of the large graphs, its large graphs are often unstructured and highly irregular and efficient parallelization is a challenge. The Xeon Phi is a this limits their scalability and performance when executing massively parallel architecture available as an off-the-shelf on off-the-shelf systems [18]. accelerator, which includes a powerful 512 bit vector unit The Breadth First Search (BFS) is a building block of with optimized scatter and gather functions. Given its po- graph algorithms. Despite its simplicity, its parallelization tential benefits, work related to graph traversing on this has been a real challenge but it remains a good candidate architecture is an active area of research. for acceleration. To date, different accelerators have been We present a set of experiments in which we explore ar- targeted to improve the performance of this algorithm such chitectural features of the Xeon Phi and how best to exploit as GPGPUs [11] and FPGAs [28]. The Intel Xeon Phi, also them in a top-down BFS algorithm but the techniques can known as MIC (Intel's Many Integrated Core Architecture), be applied to the current state-of-the-art hybrid, top-down is a massively parallel off-the-shelf system consisting of up plus bottom-up, algorithms. to 60 cores with 4-way simultaneous multi-threading (SMT) We focus on the exploitation of the vector unit by devel- for a maximum of 240 logical cores [21]. As well as this oping an improved highly vectorized OpenMP parallel al- thread-level parallelism, each core has a 512 bit vector unit gorithm, using vector intrinsics, and understanding the use allowing data-level parallelism to be extracted by using Sin- of data alignment and prefetching. In addition, we inves- gle Instruction Multiple Data (SIMD) programming. We tigate the impact of hyperthreading and thread affinity on believe that by exploiting both of these forms of parallelism, performance, a topic that appears under researched in the the Xeon Phi is an interesting platform for exploring parallel literature. As a result, we achieve what we believe is the implementations of graph algorithms. fastest published top-down BFS algorithm on the version of The goal of this study is to demonstrate through experi- Xeon Phi used in our experiments. The vectorized BFS top- ments and analysis the impact of using the Xeon Phi archi- down source code presented in this paper can be available tecture to accelerate BFS in a single Xeon Phi device, as a on request as free-to-use software. step towards the multi-device solutions that will be needed to tackle very large graph-based datasets. As a starting point, we took the description of the top-down BFS algo- arXiv:1604.02844v1 [cs.DC] 11 Apr 2016 Keywords rithm in [9] which is summarised in Section 3. Although this graph algorithms; BFS; large scale; irregular; graph travers- offers a good starting point on the Xeon Phi, there are still ing; hyperthreading; thread affinity; prefetching some architectural features that need to be well understood in order to be exploited. The contribution of this paper is twofold; first, we present the results of a series of exper- 1. INTRODUCTION iments demonstrating the benefit of successfully exploiting architectural features of the Xeon Phi focusing on the vector unit (programmed via vector intrinsics), with data align- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ment and prefetching. In addition, we present the results for profit or commercial advantage and that copies bear this notice and the full cita- of investigations into the impact of hyperthreading (Intel's tion on the first page. Copyrights for components of this work owned by others than term for SMT) and thread affinity when the Phi is under- ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission populated with threads. and/or a fee. Request permissions from [email protected]. The structure of the paper is as follows. The Xeon Phi WOODSTOCK ’97 El Paso, Texas USA architecture is presented in Section 2. We present the pro- c 2016 ACM. ISBN 123-4567-24-567/08/06. $15.00 cedure of vectorizing the BFS algorithm on the Xeon Phi, DOI: 10.475/123 4 Core Core the compiler to vectorize certain parts in the code. Manual vectorization can be set by using SIMD pragma directives L2 L2 supported in the compiler. The compiler also supports a wide range of intrinsics which allow a programmer low-level control of the vector unit. TD TD 3. BREADTH FIRST SEARCH TD TD GDDR 5 The Breadth First Search (BFS) algorithm is one of the GDDR 5 building blocks for graph analysis algorithms including be- tweenness centrality, shortest path and connected compo- L2 L2 nents [4]. Given a graph G and a starting vertex s, the BFS systematically explores all the edges E of G to trace all the Core Core vertices V that can be reached from s [7], where jV j is the number of vertices and jEj is the number of edges. The Figure 1: The Intel R Xeon Phi Microarchitecture. output is a BFS spanning tree (bfs tree of all the vertices encountered from s. starting with the serial BFS algorithm in Section 3.1, fol- The simplest sequential BFS algorithm consists in having lowed by an initial parallel version in Section 3.2, which a queue that contains a list of vertices waiting to be pro- we call non-simd version. Next, we discuss the simd ver- cessed. Enqueue and dequeue operations on the queue can sion that exploits the vector unit on the Xeon Phi in Sec- be implemented in constant time Θ(1) [16]. Thus, the run- tion 4. We then present performance comparisons between ning time of this serial BFS algorithm is O(V + E). Despite the non-simd and simd versions and explore the impact of the queue being simple and efficient, during parallelization, parallelism at two levels of granularity (OpenMP threading it has some drawbacks. The queue can be a bottleneck since and the vector unit), achieving better results for the native it implies a vertex processing order because the dequeue op- BFS top-down algorithm on the Xeon Phi than those previ- eration typically takes out the vertex that has been added ously presented in [10] and [24]. In Section 4 we explore the first. To address this problem vertices can be partitioned impact of data prefetching and discuss the effect of thread into layers. A layer consists of a set of all vertices with the 1 affinity mapping, leading us to key optimizations, and moti- same distance from the source vertex. Processing vertices vating future work on the possibility of using helper threads by layers avoids the order restriction imposed by the queue, to improve performance. The experimental setup and anal- allowing vertices to be explored in any order as long as they ysis of our results are shown in Sections 5 and 6. Related are in the same layer. However, each layer has to be pro- work is briefly discussed in Section 7 and conclusions and cessed in sequence; that is all vertices with distance k, (layer future work are discussed in Section 8. Lk) are processed before those in layer Lk+1. Traversing the graph from the vertices in the top layer 2. THE XEON PHI ARCHITECTURE down to their children in the bottom layer is a conventional approach known as top-down whereas traversing from the R TM The Intel Xeon Phi coprocessor used in this work is vertices in the bottom layer to find their parent in the top composed of 60, 4 way-SMT Pentium-based cores [21] and a layer is known as bottom-up approach. The bottom-up con- main memory of 8 GB. Each core contains a powerful 512- cept was introduced by [3] in a hybrid approach that explores bits vector processing unit and a cache memory divided into the graph in both directions, first traversing the graph with L1 (32KB) and L2 (512KB) kept fully coherent by a global- the top-down and then swaping to the bottom-up approach distributed tag directory (TD), coordinated by the cache co- in the middle layers to finally swap back to the top-down.
Recommended publications
  • Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
    Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles
    [Show full text]
  • CSC 553 Operating Systems Multiple Processes
    CSC 553 Operating Systems Lecture 4 - Concurrency: Mutual Exclusion and Synchronization Multiple Processes • Operating System design is concerned with the management of processes and threads: • Multiprogramming • Multiprocessing • Distributed Processing Concurrency Arises in Three Different Contexts: • Multiple Applications – invented to allow processing time to be shared among active applications • Structured Applications – extension of modular design and structured programming • Operating System Structure – OS themselves implemented as a set of processes or threads Key Terms Related to Concurrency Principles of Concurrency • Interleaving and overlapping • can be viewed as examples of concurrent processing • both present the same problems • Uniprocessor – the relative speed of execution of processes cannot be predicted • depends on activities of other processes • the way the OS handles interrupts • scheduling policies of the OS Difficulties of Concurrency • Sharing of global resources • Difficult for the OS to manage the allocation of resources optimally • Difficult to locate programming errors as results are not deterministic and reproducible Race Condition • Occurs when multiple processes or threads read and write data items • The final result depends on the order of execution – the “loser” of the race is the process that updates last and will determine the final value of the variable Operating System Concerns • Design and management issues raised by the existence of concurrency: • The OS must: – be able to keep track of various processes
    [Show full text]
  • Towards Gradually Typed Capabilities in the Pi-Calculus
    Towards Gradually Typed Capabilities in the Pi-Calculus Matteo Cimini University of Massachusetts Lowell Lowell, MA, USA matteo [email protected] Gradual typing is an approach to integrating static and dynamic typing within the same language, and puts the programmer in control of which regions of code are type checked at compile-time and which are type checked at run-time. In this paper, we focus on the π-calculus equipped with types for the modeling of input-output capabilities of channels. We present our preliminary work towards a gradually typed version of this calculus. We present a type system, a cast insertion procedure that automatically inserts run-time checks, and an operational semantics of a π-calculus that handles casts on channels. Although we do not claim any theoretical results on our formulations, we demonstrate our calculus with an example and discuss our future plans. 1 Introduction This paper presents preliminary work on integrating dynamically typed features into a typed π-calculus. Pierce and Sangiorgi have defined a typed π-calculus in which channels can be assigned types that express input and output capabilities [16]. Channels can be declared to be input-only, output-only, or that can be used for both input and output operations. As a consequence, this type system can detect unintended or malicious channel misuses at compile-time. The static typing nature of the type system prevents the modeling of scenarios in which the input- output discipline of a channel is discovered at run-time. In these scenarios, such a discipline must be enforced with run-time type checks in the style of dynamic typing.
    [Show full text]
  • Enhanced Debugging for Data Races in Parallel
    ENHANCED DEBUGGING FOR DATA RACES IN PARALLEL PROGRAMS USING OPENMP A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Marcus W. Hervey December 2016 ENHANCED DEBUGGING FOR DATA RACES IN PARALLEL PROGRAMS USING OPENMP Marcus W. Hervey APPROVED: Dr. Edgar Gabriel, Chairman Dept. of Computer Science Dr. Shishir Shah Dept. of Computer Science Dr. Barbara Chapman Dept. of Computer Science, Stony Brook University Dean, College of Natural Sciences and Mathematics ii iii Acknowledgements A special thank you to Dr. Barbara Chapman, Ph.D., Dr. Edgar Gabriel, Ph.D., Dr. Shishir Shah, Ph.D., Dr. Lei Huang, Ph.D., Dr. Chunhua Liao, Ph.D., Dr. Laksano Adhianto, Ph.D., and Dr. Oscar Hernandez, Ph.D. for their guidance and support throughout this endeavor. I would also like to thank Van Bui, Deepak Eachempati, James LaGrone, and Cody Addison for their friendship and teamwork, without which this endeavor would have never been accomplished. I dedicate this thesis to my parents (Billy and Olivia Hervey) who have always chal- lenged me to be my best, and to my wife and kids for their love and sacrifice throughout this process. ”Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time.” – Thomas Alva Edison iv ENHANCED DEBUGGING FOR DATA RACES IN PARALLEL PROGRAMS USING OPENMP An Abstract of a Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Marcus W.
    [Show full text]
  • Reconciling Real and Stochastic Time: the Need for Probabilistic Refinement
    DOI 10.1007/s00165-012-0230-y The Author(s) © 2012. This article is published with open access at Springerlink.com Formal Aspects Formal Aspects of Computing (2012) 24: 497–518 of Computing Reconciling real and stochastic time: the need for probabilistic refinement J. Markovski1,P.R.D’Argenio2,J.C.M.Baeten1,3 and E. P. de Vink1,3 1 Eindhoven University of Technology, Eindhoven, The Netherlands. E-mail: [email protected] 2 FaMAF, Universidad Nacional de Cordoba,´ Cordoba,´ Argentina 3 Centrum Wiskunde & Informatica, Amsterdam, The Netherlands Abstract. We conservatively extend an ACP-style discrete-time process theory with discrete stochastic delays. The semantics of the timed delays relies on time additivity and time determinism, which are properties that enable us to merge subsequent timed delays and to impose their synchronous expiration. Stochastic delays, however, interact with respect to a so-called race condition that determines the set of delays that expire first, which is guided by an (implicit) probabilistic choice. The race condition precludes the property of time additivity as the merger of stochastic delays alters this probabilistic behavior. To this end, we resolve the race condition using conditionally-distributed unit delays. We give a sound and ground-complete axiomatization of the process the- ory comprising the standard set of ACP-style operators. In this generalized setting, the alternative composition is no longer associative, so we have to resort to special normal forms that explicitly resolve the underlying race condition. Our treatment succeeds in the initial challenge to conservatively extend standard time with stochastic time. However, the ‘dissection’ of the stochastic delays to conditionally-distributed unit delays comes at a price, as we can no longer relate the resolved race condition to the original stochastic delays.
    [Show full text]
  • Lecture 18 18.1 Distributed and Concurrent Systems 18.2 Race
    CMPSCI 691ST Systems Fall 2011 Lecture 18 Lecturer: Emery Berger Scribe: Sean Barker 18.1 Distributed and Concurrent Systems In this lecture we discussed two broad issues in concurrent systems: clocks and race detection. The first paper we read introduced the Lamport clock, which is a scalar providing a logical clock value for each process. Every time a process sends a message, it increments its clock value, and every time a process receives a message, it sets its clock value to the max of the current value and the message value. Clock values are used to provide a `happens-before' relationship, which provides a consistent partial ordering of events across processes. Lamport clocks were later extended to vector clocks, in which each of n processes maintains an array (or vector) of n clock values (one for each process) rather than just a single value as in Lamport clocks. Vector clocks are much more precise than Lamport clocks, since they can check whether intermediate events from anyone have occurred (by looking at the max across the entire vector), and have essentially replaced Lamport clocks today. Moreover, since distributed systems are already sending many messages, the cost of maintaining vector clocks is generally not a big deal. Clock synchronization is still a hard problem, however, because physical clock values will always have some amount of drift. The FastTrack paper we read leverages vector clocks to do race detection. 18.2 Race Detection Broadly speaking, a race condition occurs when at least two threads are accessing a variable, and one of them writes the value while another is reading.
    [Show full text]
  • Rapid Prototyping of High-Performance Concurrent Java Applications
    Rapid Prototyping of High-Performance Concurrent Java Applications K.J.R. Powell Master of Science School of Computer Science School of Informatics University of Edinburgh 2002 (Graduation date:December 2002) Abstract The design and implementation of concurrent applications is more challenging than that of sequential applications. The aim of this project is to address this challenge by producing an application which can generate skeleton Java systems from a high-level PEPA modelling language description. By automating the process of translating the design into a Java skeleton system, the result will maintain the performance and be- havioural characteristics of the model, providing a sound framework for completing the concurrent application. This method accelerates the process of initial implementation whilst ensuring the system characterisics remain true to the high-level model. i Acknowledgements Many thanks to my supervisor Stephen Gilmore for his excellent guidance and ideas. I’d also like to express my gratitude to L.L. for his assistance above and beyond the call of duty in the field of editing. ii Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (K.J.R. Powell) iii Table of Contents 1 Introduction 1 1.1 Principal Goal . 1 1.2 PEPA and Java . 2 1.3 Project Objective . 2 2 Performance Evaluation Process Algebra 4 2.1 Overview of PEPA and its role in design . 4 2.2 The Syntax of PEPA .
    [Show full text]
  • Towards Systematic Black-Box Testing for Exploitable Race Conditions in Web Apps
    1 Faculty of Electrical Engineering, Mathematics & Computer Science Towards Systematic Black-Box Testing for Exploitable Race Conditions in Web Apps Rob J. van Emous [email protected] Master Thesis June 2019 Supervisors: prof. dr. M. Huisman dr. ing. E. Tews (Computest) M.Sc. D. Keuper Formal Methods and Tools Group Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente P.O. Box 217 7500 AE Enschede The Netherlands ABSTRACT As web applications get more complicated and more integrated into our daily lives, securing them against known cyber attacks is of great importance. Performing se- curity tests is the primary way to discover the current risks and act accordingly. In order to perform these test in a systematic way, testers create and use numer- ous vulnerability lists and testing guidelines. Often, dedicated security institutions like Escal Institute of Advanced Technologies (SANS), MITRE, Certified Secure, and the Open Web Application Security Project (OWASP) create these guidelines. These lists are not meant to be exhaustive, but as the introduction in the Common Weakness Enumeration (CWE) of MITRE and SANS by Martin et al. (2011) puts it: they "(..) evaluated each weakness based on prevalence, importance, and the likelihood of exploit". This seems like a reasonable way of thinking, but as we have shown in this research, it can also lead to oversight of essential, but stealthy security issues. In this research, we will focus on one such stealthy issue in web apps called a race condition. Race conditions are known for a very long time as research by Abbott et al.
    [Show full text]
  • Iota: a Calculus for Internet of Things Automation
    Iota: A Calculus for Internet of Things Automation Julie L. Newcomb Satish Chandra∗ Jean-Baptiste Jeannin† Paul G. Allen School of Computer Samsung Research America Samsung Research America Science & Engineering Mountain View, CA, USA Mountain View, CA, USA University of Washington schandra@acm:org jb:jeannin@gmail:com Seattle, WA, USA newcombj@cs:washington:edu Cole Schlesinger‡ Manu Sridharan§ Samsung Research America Samsung Research America Mountain View, CA, USA Mountain View, CA, USA cole@schlesinger:tech manu@sridharan:net Abstract analyses from the software engineering literature, and ex- Programmatically controllable home devices are proliferat- pressive by encoding sixteen programs from these home ing, ranging from lights, locks, and motion sensors to smart automation platforms. Along the way, we highlight how the refrigerators, televisions, and cameras, giving end users un- design of the Iota semantics rules out subtle classes of bugs. precedented control over their environment. New domain- CCS Concepts • Software and its engineering → Gen- specific languages are emerging to supplant general purpose eral programming languages; • Social and professional programming platforms as a means for end users to con- topics → History of programming languages; figure home automation. These languages, based on event- condition-action (ECA) rules, have an appealing simplicity. Keywords domain-specific languages, language design and But programmatic control allows users write to programs implementation with bugs, introducing the frustrations of software engineer- ACM Reference Format: ing with none of the tool support. The subtle semantics of Julie L. Newcomb, Satish Chandra, Jean-Baptiste Jeannin, Cole the home automation domain—as well as the varying in- Schlesinger, and Manu Sridharan.
    [Show full text]
  • Preview – Interprocess Communication
    Preview – Interprocess Communication Race Condition Mutual Exclusion with Busy Waiting Disabling Interrupts Lock Variables Strict Alternation Peterson’s Solution Test and Set Lock – help from hardware Sleep and Wakeup The Producer-Consumer Problem Race Condition in Producer-Consumer Problem Semaphores The producer-consumer problem using semaphores Mutexes Monitors CS 431 Operating Systems 1 Interprocess Communication Critical section (critical region) – The part of program where the shared memory is accessed. Three issues in InterProcess Communication (IPC) 1. How one process can pass information to another 2. How to make sure two or more processes do not get into the critical region at the same time 3. Proper sequencing when dependencies are present (e.g. A creates outputs, B consumes the outputs) CS 431 Operating Systems 2 Race Condition A situation where two or more processes are reading or writing some shared data and the final result depends on who runs precisely when, is called race condition. Network Printer Spooler Directory CS 431 Operating Systems 3 Race Condition Process B will never Process A tries to send a job to receive any output from spooler, Process A reads the printer !! in = 7, process A times out and goes to “ready” state before updating in = in + 1. CPU switches to Process B. Process B also tries to send a job to spooler. Process B reads in = 7, writes its job name in slot 7, updates i to 8 and then goes to “ready” state. CPU switches to Process A. Process A starts from the place it left off. Process writes its job name in slot 7, update i to 8.
    [Show full text]
  • Comparison of Modeling Approaches to Business Process Performance Evaluation Kelly Rosa Braghetto, João Eduardo Ferreira, Jean-Marc Vincent
    Comparison of Modeling Approaches to Business Process Performance Evaluation Kelly Rosa Braghetto, João Eduardo Ferreira, Jean-Marc Vincent To cite this version: Kelly Rosa Braghetto, João Eduardo Ferreira, Jean-Marc Vincent. Comparison of Modeling Ap- proaches to Business Process Performance Evaluation. [Research Report] RR-7065, INRIA. 2009. inria-00424452 HAL Id: inria-00424452 https://hal.inria.fr/inria-00424452 Submitted on 15 Oct 2009 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE Comparison of Modeling Approaches to Business Process Performance Evaluation Kelly Rosa Braghetto — João Eduardo Ferreira — Jean-Marc Vincent N° 7065 October 2009 apport de recherche ISSN 0249-6399 ISRN INRIA/RR--7065--FR+ENG Comparison of Modeling Approaches to Business Process Performance Evaluation Kelly Rosa Braghetto ∗ † , João Eduardo Ferreira ∗ , Jean-Marc Vincent ‡ Thème : Calcule distribué et applications à très haute performance Équipe-Projet MESCAL Rapport de recherche n° 7065 — October
    [Show full text]
  • Critical Section and Mutual Exclusion
    Part III Synchronization Critical Section and Mutual Exclusion The question of whether computers can think is just like the question of whether submarines can swim 1 Fall 2015 Edsger W. Dijkstra Process Synchronization Topics . Why is synchronization needed? . Race Conditions . Critical Sections . Pure Software Solutions . Hardware Support . Semaphores . Race Conditions, Revisited . Monitors 2 Synchronization Needed! 1/6 int a[3] = { 3, 4, 5}; Process 1 Process 2 a[1] = a[0] + a[1]; a[2] = a[1] + a[2]; a[3] = { 3, ?, ? } Statement level execution interleaving 3 Synchronization Needed! 2/6 int a[3] = { 3, 4, 5}; Process 1 Process 2 a[1] = a[0] + a[1]; a[2] = a[1] + a[2]; .If process 1 updates a[1] first, a[1] is 7, and a[ ]={3,7,5} .Then, process 2 uses the new a[1] to computes a[2], and a[ ]={3,7,12} .If process 2 uses a[1] first, now a[2] is 9, and a[ ]={3,4,9} .Then, process 1 computes a[1], and a[ ]={3,7,9} Results are non-deterministic! 4 Synchronization Needed! 3/6 int Count = 10; Higher-level language statements are not atomic Process 1 Process 2 Count++; Count--; Count = 9, 10 or 11? 5 Synchronization Needed! 4/6 int Count = 10; Process 1 Process 2 LOAD Reg, Count LOAD Reg, Count ADD #1 SUB #1 STORE Reg, Count STORE Reg, Count The problem is that the execution flow may be switched in the middle. Results become non-deterministic! instruction level execution interleaving 6 Synchronization Needed! 5/6 Process 1 Process 2 Inst Reg Memory Inst Reg Memory LOAD 10 10 LOAD 10 10 SUB 9 10 ADD 11 10 STORE 11 11 overwrites the previous value 11 STORE 9 9 Always use instruction level interleaving to show race conditions 7 Synchronization Needed! 6/6 Process 1 Process 2 Inst Reg Memory Inst Reg Memory LOAD 10 10 ADD 11 10 LOAD 10 10 SUB 9 10 STORE 9 9 STORE 11 11 overwrites the previous value 9 Always use instruction level interleaving to show race conditions 8 Race Conditions .
    [Show full text]