A Programmer-Oriented Approach to Safe Concurrency
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline
Introduction to multi-threading and vectorization Matti Kortelainen LArSoft Workshop 2019 25 June 2019 Outline Broad introductory overview: • Why multithread? • What is a thread? • Some threading models – std::thread – OpenMP (fork-join) – Intel Threading Building Blocks (TBB) (tasks) • Race condition, critical region, mutual exclusion, deadlock • Vectorization (SIMD) 2 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading Image courtesy of K. Rupp 3 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization Motivations for multithreading • One process on a node: speedups from parallelizing parts of the programs – Any problem can get speedup if the threads can cooperate on • same core (sharing L1 cache) • L2 cache (may be shared among small number of cores) • Fully loaded node: save memory and other resources – Threads can share objects -> N threads can use significantly less memory than N processes • If smallest chunk of data is so big that only one fits in memory at a time, is there any other option? 4 6/25/19 Matti Kortelainen | Introduction to multi-threading and vectorization What is a (software) thread? (in POSIX/Linux) • “Smallest sequence of programmed instructions that can be managed independently by a scheduler” [Wikipedia] • A thread has its own – Program counter – Registers – Stack – Thread-local memory (better to avoid in general) • Threads of a process share everything else, e.g. – Program code, constants – Heap memory – Network connections – File handles -
Lecture 4 Resource Protection and Thread Safety
Concurrency and Correctness – Resource Protection and TS Lecture 4 Resource Protection and Thread Safety 1 Danilo Piparo – CERN, EP-SFT Concurrency and Correctness – Resource Protection and TS This Lecture The Goals: 1) Understand the problem of contention of resources within a parallel application 2) Become familiar with the design principles and techniques to cope with it 3) Appreciate the advantage of non-blocking techniques The outline: § Threads and data races: synchronisation issues § Useful design principles § Replication, atomics, transactions and locks § Higher level concrete solutions 2 Danilo Piparo – CERN, EP-SFT Concurrency and Correctness – Resource Protection and TS Threads and Data Races: Synchronisation Issues 3 Danilo Piparo – CERN, EP-SFT Concurrency and Correctness – Resource Protection and TS The Problem § Fastest way to share data: access the same memory region § One of the advantages of threads § Parallel memory access: delicate issue - race conditions § I.e. behaviour of the system depends on the sequence of events which are intrinsically asynchronous § Consequences, in order of increasing severity § Catastrophic terminations: segfaults, crashes § Non-reproducible, intermittent bugs § Apparently sane execution but data corruption: e.g. wrong value of a variable or of a result Operative definition: An entity which cannot run w/o issues linked to parallel execution is said to be thread-unsafe (the contrary is thread-safe) 4 Danilo Piparo – CERN, EP-SFT Concurrency and Correctness – Resource Protection and TS To Be Precise: Data Race Standard language rules, §1.10/4 and /21: • Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location. -
Clojure, Given the Pun on Closure, Representing Anything Specific
dynamic, functional programming for the JVM “It (the logo) was designed by my brother, Tom Hickey. “It I wanted to involve c (c#), l (lisp) and j (java). I don't think we ever really discussed the colors Once I came up with Clojure, given the pun on closure, representing anything specific. I always vaguely the available domains and vast emptiness of the thought of them as earth and sky.” - Rich Hickey googlespace, it was an easy decision..” - Rich Hickey Mark Volkmann [email protected] Functional Programming (FP) In the spirit of saying OO is is ... encapsulation, inheritance and polymorphism ... • Pure Functions • produce results that only depend on inputs, not any global state • do not have side effects such as Real applications need some changing global state, file I/O or database updates side effects, but they should be clearly identified and isolated. • First Class Functions • can be held in variables • can be passed to and returned from other functions • Higher Order Functions • functions that do one or both of these: • accept other functions as arguments and execute them zero or more times • return another function 2 ... FP is ... Closures • main use is to pass • special functions that retain access to variables a block of code that were in their scope when the closure was created to a function • Partial Application • ability to create new functions from existing ones that take fewer arguments • Currying • transforming a function of n arguments into a chain of n one argument functions • Continuations ability to save execution state and return to it later think browser • back button 3 .. -
CSC 553 Operating Systems Multiple Processes
CSC 553 Operating Systems Lecture 4 - Concurrency: Mutual Exclusion and Synchronization Multiple Processes • Operating System design is concerned with the management of processes and threads: • Multiprogramming • Multiprocessing • Distributed Processing Concurrency Arises in Three Different Contexts: • Multiple Applications – invented to allow processing time to be shared among active applications • Structured Applications – extension of modular design and structured programming • Operating System Structure – OS themselves implemented as a set of processes or threads Key Terms Related to Concurrency Principles of Concurrency • Interleaving and overlapping • can be viewed as examples of concurrent processing • both present the same problems • Uniprocessor – the relative speed of execution of processes cannot be predicted • depends on activities of other processes • the way the OS handles interrupts • scheduling policies of the OS Difficulties of Concurrency • Sharing of global resources • Difficult for the OS to manage the allocation of resources optimally • Difficult to locate programming errors as results are not deterministic and reproducible Race Condition • Occurs when multiple processes or threads read and write data items • The final result depends on the order of execution – the “loser” of the race is the process that updates last and will determine the final value of the variable Operating System Concerns • Design and management issues raised by the existence of concurrency: • The OS must: – be able to keep track of various processes -
Towards Gradually Typed Capabilities in the Pi-Calculus
Towards Gradually Typed Capabilities in the Pi-Calculus Matteo Cimini University of Massachusetts Lowell Lowell, MA, USA matteo [email protected] Gradual typing is an approach to integrating static and dynamic typing within the same language, and puts the programmer in control of which regions of code are type checked at compile-time and which are type checked at run-time. In this paper, we focus on the π-calculus equipped with types for the modeling of input-output capabilities of channels. We present our preliminary work towards a gradually typed version of this calculus. We present a type system, a cast insertion procedure that automatically inserts run-time checks, and an operational semantics of a π-calculus that handles casts on channels. Although we do not claim any theoretical results on our formulations, we demonstrate our calculus with an example and discuss our future plans. 1 Introduction This paper presents preliminary work on integrating dynamically typed features into a typed π-calculus. Pierce and Sangiorgi have defined a typed π-calculus in which channels can be assigned types that express input and output capabilities [16]. Channels can be declared to be input-only, output-only, or that can be used for both input and output operations. As a consequence, this type system can detect unintended or malicious channel misuses at compile-time. The static typing nature of the type system prevents the modeling of scenarios in which the input- output discipline of a channel is discovered at run-time. In these scenarios, such a discipline must be enforced with run-time type checks in the style of dynamic typing. -
Java Concurrency in Practice
Java Concurrency in practice Chapters: 1,2, 3 & 4 Bjørn Christian Sebak ([email protected]) Karianne Berg ([email protected]) INF329 – Spring 2007 Chapter 1 - Introduction Brief history of concurrency Before OS, a computer executed a single program from start to finnish But running a single program at a time is an inefficient use of computer hardware Therefore all modern OS run multiple programs (in seperate processes) Brief history of concurrency (2) Factors for running multiple processes: Resource utilization: While one program waits for I/O, why not let another program run and avoid wasting CPU cycles? Fairness: Multiple users/programs might have equal claim of the computers resources. Avoid having single large programs „hog“ the machine. Convenience: Often desirable to create smaller programs that perform a single task (and coordinate them), than to have one large program that do ALL the tasks What is a thread? A „lightweight process“ - each process can have many threads Threads allow multiple streams of program flow to coexits in a single process. While a thread share process-wide resources like memory and files with other threads, they all have their own program counter, stack and local variables Benefits of threads 1) Exploiting multiple processors 2) Simplicity of modeling 3) Simplified handling of asynchronous events 4) More responsive user interfaces Benefits of threads (2) Exploiting multiple processors The processor industry is currently focusing on increasing number of cores on a single CPU rather than increasing clock speed. Well-designed programs with multiple threads can execute simultaneously on multiple processors, increasing resource utilization. -
Enhanced Debugging for Data Races in Parallel
ENHANCED DEBUGGING FOR DATA RACES IN PARALLEL PROGRAMS USING OPENMP A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Marcus W. Hervey December 2016 ENHANCED DEBUGGING FOR DATA RACES IN PARALLEL PROGRAMS USING OPENMP Marcus W. Hervey APPROVED: Dr. Edgar Gabriel, Chairman Dept. of Computer Science Dr. Shishir Shah Dept. of Computer Science Dr. Barbara Chapman Dept. of Computer Science, Stony Brook University Dean, College of Natural Sciences and Mathematics ii iii Acknowledgements A special thank you to Dr. Barbara Chapman, Ph.D., Dr. Edgar Gabriel, Ph.D., Dr. Shishir Shah, Ph.D., Dr. Lei Huang, Ph.D., Dr. Chunhua Liao, Ph.D., Dr. Laksano Adhianto, Ph.D., and Dr. Oscar Hernandez, Ph.D. for their guidance and support throughout this endeavor. I would also like to thank Van Bui, Deepak Eachempati, James LaGrone, and Cody Addison for their friendship and teamwork, without which this endeavor would have never been accomplished. I dedicate this thesis to my parents (Billy and Olivia Hervey) who have always chal- lenged me to be my best, and to my wife and kids for their love and sacrifice throughout this process. ”Our greatest weakness lies in giving up. The most certain way to succeed is always to try just one more time.” – Thomas Alva Edison iv ENHANCED DEBUGGING FOR DATA RACES IN PARALLEL PROGRAMS USING OPENMP An Abstract of a Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Marcus W. -
Reconciling Real and Stochastic Time: the Need for Probabilistic Refinement
DOI 10.1007/s00165-012-0230-y The Author(s) © 2012. This article is published with open access at Springerlink.com Formal Aspects Formal Aspects of Computing (2012) 24: 497–518 of Computing Reconciling real and stochastic time: the need for probabilistic refinement J. Markovski1,P.R.D’Argenio2,J.C.M.Baeten1,3 and E. P. de Vink1,3 1 Eindhoven University of Technology, Eindhoven, The Netherlands. E-mail: [email protected] 2 FaMAF, Universidad Nacional de Cordoba,´ Cordoba,´ Argentina 3 Centrum Wiskunde & Informatica, Amsterdam, The Netherlands Abstract. We conservatively extend an ACP-style discrete-time process theory with discrete stochastic delays. The semantics of the timed delays relies on time additivity and time determinism, which are properties that enable us to merge subsequent timed delays and to impose their synchronous expiration. Stochastic delays, however, interact with respect to a so-called race condition that determines the set of delays that expire first, which is guided by an (implicit) probabilistic choice. The race condition precludes the property of time additivity as the merger of stochastic delays alters this probabilistic behavior. To this end, we resolve the race condition using conditionally-distributed unit delays. We give a sound and ground-complete axiomatization of the process the- ory comprising the standard set of ACP-style operators. In this generalized setting, the alternative composition is no longer associative, so we have to resort to special normal forms that explicitly resolve the underlying race condition. Our treatment succeeds in the initial challenge to conservatively extend standard time with stochastic time. However, the ‘dissection’ of the stochastic delays to conditionally-distributed unit delays comes at a price, as we can no longer relate the resolved race condition to the original stochastic delays. -
C/C++ Thread Safety Analysis
C/C++ Thread Safety Analysis DeLesley Hutchins Aaron Ballman Dean Sutherland Google Inc. CERT/SEI Email: [email protected] Email: [email protected] Email: [email protected] Abstract—Writing multithreaded programs is hard. Static including MacOS, Linux, and Windows. The analysis is analysis tools can help developers by allowing threading policies currently implemented as a compiler warning. It has been to be formally specified and mechanically checked. They essen- deployed on a large scale at Google; all C++ code at Google is tially provide a static type system for threads, and can detect potential race conditions and deadlocks. now compiled with thread safety analysis enabled by default. This paper describes Clang Thread Safety Analysis, a tool II. OVERVIEW OF THE ANALYSIS which uses annotations to declare and enforce thread safety policies in C and C++ programs. Clang is a production-quality Thread safety analysis works very much like a type system C++ compiler which is available on most platforms, and the for multithreaded programs. It is based on theoretical work analysis can be enabled for any build with a simple warning on race-free type systems [3]. In addition to declaring the flag: −Wthread−safety. The analysis is deployed on a large scale at Google, where type of data ( int , float , etc.), the programmer may optionally it has provided sufficient value in practice to drive widespread declare how access to that data is controlled in a multithreaded voluntary adoption. Contrary to popular belief, the need for environment. annotations has not been a liability, and even confers some Clang thread safety analysis uses annotations to declare benefits with respect to software evolution and maintenance. -
Lecture 18 18.1 Distributed and Concurrent Systems 18.2 Race
CMPSCI 691ST Systems Fall 2011 Lecture 18 Lecturer: Emery Berger Scribe: Sean Barker 18.1 Distributed and Concurrent Systems In this lecture we discussed two broad issues in concurrent systems: clocks and race detection. The first paper we read introduced the Lamport clock, which is a scalar providing a logical clock value for each process. Every time a process sends a message, it increments its clock value, and every time a process receives a message, it sets its clock value to the max of the current value and the message value. Clock values are used to provide a `happens-before' relationship, which provides a consistent partial ordering of events across processes. Lamport clocks were later extended to vector clocks, in which each of n processes maintains an array (or vector) of n clock values (one for each process) rather than just a single value as in Lamport clocks. Vector clocks are much more precise than Lamport clocks, since they can check whether intermediate events from anyone have occurred (by looking at the max across the entire vector), and have essentially replaced Lamport clocks today. Moreover, since distributed systems are already sending many messages, the cost of maintaining vector clocks is generally not a big deal. Clock synchronization is still a hard problem, however, because physical clock values will always have some amount of drift. The FastTrack paper we read leverages vector clocks to do race detection. 18.2 Race Detection Broadly speaking, a race condition occurs when at least two threads are accessing a variable, and one of them writes the value while another is reading. -
Programming Clojure Second Edition
Extracted from: Programming Clojure Second Edition This PDF file contains pages extracted from Programming Clojure, published by the Pragmatic Bookshelf. For more information or to purchase a paperback or PDF copy, please visit http://www.pragprog.com. Note: This extract contains some colored text (particularly in code listing). This is available only in online versions of the books. The printed versions are black and white. Pagination might vary between the online and printer versions; the content is otherwise identical. Copyright © 2010 The Pragmatic Programmers, LLC. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. The Pragmatic Bookshelf Dallas, Texas • Raleigh, North Carolina Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and The Pragmatic Programmers, LLC was aware of a trademark claim, the designations have been printed in initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer, Pragmatic Programming, Pragmatic Bookshelf, PragProg and the linking g device are trade- marks of The Pragmatic Programmers, LLC. Every precaution was taken in the preparation of this book. However, the publisher assumes no responsibility for errors or omissions, or for damages that may result from the use of information (including program listings) contained herein. Our Pragmatic courses, workshops, and other products can help you and your team create better software and have more fun. -
Breadth First Search Vectorization on the Intel Xeon Phi
Breadth First Search Vectorization on the Intel Xeon Phi Mireya Paredes Graham Riley Mikel Luján University of Manchester University of Manchester University of Manchester Oxford Road, M13 9PL Oxford Road, M13 9PL Oxford Road, M13 9PL Manchester, UK Manchester, UK Manchester, UK [email protected] [email protected] [email protected] ABSTRACT Today, scientific experiments in many domains, and large Breadth First Search (BFS) is a building block for graph organizations can generate huge amounts of data, e.g. social algorithms and has recently been used for large scale anal- networks, health-care records and bio-informatics applica- ysis of information in a variety of applications including so- tions [8]. Graphs seem to be a good match for important cial networks, graph databases and web searching. Due to and large dataset analysis as they can abstract real world its importance, a number of different parallel programming networks. To process large graphs, different techniques have models and architectures have been exploited to optimize been applied, including parallel programming. The main the BFS. However, due to the irregular memory access pat- challenge in the parallelization of graph processing is that terns and the unstructured nature of the large graphs, its large graphs are often unstructured and highly irregular and efficient parallelization is a challenge. The Xeon Phi is a this limits their scalability and performance when executing massively parallel architecture available as an off-the-shelf on off-the-shelf systems [18]. accelerator, which includes a powerful 512 bit vector unit The Breadth First Search (BFS) is a building block of with optimized scatter and gather functions.