Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Other Apis What’S Wrong with Openmp?
Threaded Programming Other APIs What’s wrong with OpenMP? • OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. – cannot arbitrarily start/stop threads – cannot put threads to sleep and wake them up later • OpenMP is good for programs where each thread is doing (more-or-less) the same thing. • Although OpenMP supports C++, it’s not especially OO friendly – though it is gradually getting better. • OpenMP doesn’t support other popular base languages – e.g. Java, Python What’s wrong with OpenMP? (cont.) Can do this Can do this Can’t do this Threaded programming APIs • Essential features – a way to create threads – a way to wait for a thread to finish its work – a mechanism to support thread private data – some basic synchronisation methods – at least a mutex lock, or atomic operations • Optional features – support for tasks – more synchronisation methods – e.g. condition variables, barriers,... – higher levels of abstraction – e.g. parallel loops, reductions What are the alternatives? • POSIX threads • C++ threads • Intel TBB • Cilk • OpenCL • Java (not an exhaustive list!) POSIX threads • POSIX threads (or Pthreads) is a standard library for shared memory programming without directives. – Part of the ANSI/IEEE 1003.1 standard (1996) • Interface is a C library – no standard Fortran interface – can be used with C++, but not OO friendly • Widely available – even for Windows – typically installed as part of OS – code is pretty portable • Lots of low-level control over behaviour of threads • Lacks a proper memory consistency model Thread forking #include <pthread.h> int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void*(*start_routine, void*), void *arg) • Creates a new thread: – first argument returns a pointer to a thread descriptor. -
Analysis of GPU-Libraries for Rapid Prototyping Database Operations
Analysis of GPU-Libraries for Rapid Prototyping Database Operations A look into library support for database operations Harish Kumar Harihara Subramanian Bala Gurumurthy Gabriel Campero Durand David Broneske Gunter Saake University of Magdeburg Magdeburg, Germany fi[email protected] Abstract—Using GPUs for query processing is still an ongoing Usually, library operators for GPUs are either written by research in the database community due to the increasing hardware experts [12] or are available out of the box by device heterogeneity of GPUs and their capabilities (e.g., their newest vendors [13]. Overall, we found more than 40 libraries for selling point: tensor cores). Hence, many researchers develop optimal operator implementations for a specific device generation GPUs each packing a set of operators commonly used in one involving tedious operator tuning by hand. On the other hand, or more domains. The benefits of those libraries is that they are there is a growing availability of GPU libraries that provide constantly updated and tested to support newer GPU versions optimized operators for manifold applications. However, the and their predefined interfaces offer high portability as well as question arises how mature these libraries are and whether they faster development time compared to handwritten operators. are fit to replace handwritten operator implementations not only w.r.t. implementation effort and portability, but also in terms of This makes them a perfect match for many commercial performance. database systems, which can rely on GPU libraries to imple- In this paper, we investigate various general-purpose libraries ment well performing database operators. Some example for that are both portable and easy to use for arbitrary GPUs such systems are: SQreamDB using Thrust [14], BlazingDB in order to test their production readiness on the example of using cuDF [15], Brytlyt using the Torch library [16]. -
Designing an Ultra Low-Overhead Multithreading Runtime for Nim
Designing an ultra low-overhead multithreading runtime for Nim Mamy Ratsimbazafy Weave [email protected] https://github.com/mratsim/weave Hello! I am Mamy Ratsimbazafy During the day blockchain/Ethereum 2 developer (in Nim) During the night, deep learning and numerical computing developer (in Nim) and data scientist (in Python) You can contact me at [email protected] Github: mratsim Twitter: m_ratsim 2 Where did this talk came from? ◇ 3 years ago: started writing a tensor library in Nim. ◇ 2 threading APIs at the time: OpenMP and simple threadpool ◇ 1 year ago: complete refactoring of the internals 3 Agenda ◇ Understanding the design space ◇ Hardware and software multithreading: definitions and use-cases ◇ Parallel APIs ◇ Sources of overhead and runtime design ◇ Minimum viable runtime plan in a weekend 4 Understanding the 1 design space Concurrency vs parallelism, latency vs throughput Cooperative vs preemptive, IO vs CPU 5 Parallelism is not 6 concurrency Kernel threading 7 models 1:1 Threading 1 application thread -> 1 hardware thread N:1 Threading N application threads -> 1 hardware thread M:N Threading M application threads -> N hardware threads The same distinctions can be done at a multithreaded language or multithreading runtime level. 8 The problem How to schedule M tasks on N hardware threads? Latency vs 9 Throughput - Do we want to do all the work in a minimal amount of time? - Numerical computing - Machine learning - ... - Do we want to be fair? - Clients-server - Video decoding - ... Cooperative vs 10 Preemptive Cooperative multithreading: -
Fiber Context As a first-Class Object
Document number: P0876R6 Date: 2019-06-17 Author: Oliver Kowalke ([email protected]) Nat Goodspeed ([email protected]) Audience: SG1 fiber_context - fibers without scheduler Revision History . .1 abstract . .2 control transfer mechanism . .2 std::fiber_context as a first-class object . .3 encapsulating the stack . .4 invalidation at resumption . .4 problem: avoiding non-const global variables and undefined behaviour . .5 solution: avoiding non-const global variables and undefined behaviour . .6 inject function into suspended fiber . 11 passing data between fibers . 12 termination . 13 exceptions . 16 std::fiber_context as building block for higher-level frameworks . 16 interaction with STL algorithms . 18 possible implementation strategies . 19 fiber switch on architectures with register window . 20 how fast is a fiber switch . 20 interaction with accelerators . 20 multi-threading environment . 21 acknowledgments . 21 API ................................................................ 22 33.7 Cooperative User-Mode Threads . 22 33.7.1 General . 22 33.7.2 Empty vs. Non-Empty . 22 33.7.3 Explicit Fiber vs. Implicit Fiber . 22 33.7.4 Implicit Top-Level Function . 22 33.7.5 Header <experimental/fiber_context> synopsis . 23 33.7.6 Class fiber_context . 23 33.7.7 Function unwind_fiber() . 27 references . 29 Revision History This document supersedes P0876R5. Changes since P0876R5: • std::unwind_exception removed: stack unwinding must be performed by platform facilities. • fiber_context::can_resume_from_any_thread() renamed to can_resume_from_this_thread(). • fiber_context::valid() renamed to empty() with inverted sense. • Material has been added concerning the top-level wrapper logic governing each fiber. The change to unwinding fiber stacks using an anonymous foreign exception not catchable by C++ try / catch blocks is in response to deep discussions in Kona 2019 of the surprisingly numerous problems surfaced by using an ordinary C++ exception for that purpose. -
Fiber-Based Architecture for NFV Cloud Databases
Fiber-based Architecture for NFV Cloud Databases Vaidas Gasiunas, David Dominguez-Sal, Ralph Acker, Aharon Avitzur, Ilan Bronshtein, Rushan Chen, Eli Ginot, Norbert Martinez-Bazan, Michael Muller,¨ Alexander Nozdrin, Weijie Ou, Nir Pachter, Dima Sivov, Eliezer Levy Huawei Technologies, Gauss Database Labs, European Research Institute fvaidas.gasiunas,david.dominguez1,fi[email protected] ABSTRACT to a specific workload, as well as to change this alloca- The telco industry is gradually shifting from using mono- tion on demand. Elasticity is especially attractive to op- lithic software packages deployed on custom hardware to erators who are very conscious about Total Cost of Own- using modular virtualized software functions deployed on ership (TCO). NFV meets databases (DBs) in more than cloudified data centers using commodity hardware. This a single aspect [20, 30]. First, the NFV concepts of modu- transformation is referred to as Network Function Virtual- lar software functions accelerate the separation of function ization (NFV). The scalability of the databases (DBs) un- (business logic) and state (data) in the design and imple- derlying the virtual network functions is the cornerstone for mentation of network functions (NFs). This in turn enables reaping the benefits from the NFV transformation. This independent scalability of the DB component, and under- paper presents an industrial experience of applying shared- lines its importance in providing elastic state repositories nothing techniques in order to achieve the scalability of a DB for various NFs. In principle, the NFV-ready DBs that are in an NFV setup. The special combination of requirements relevant to our discussion are sophisticated main-memory in NFV DBs are not easily met with conventional execution key-value (KV) stores with stringent high availability (HA) models. -
A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor
The following paper was originally published in the Proceedings of the 2nd USENIX Windows NT Symposium Seattle, Washington, August 3–4, 1998 A Thread Performance Comparison: Windows NT and Solaris on A Symmetric Multiprocessor Fabian Zabatta and Kevin Ying Brooklyn College and CUNY Graduate School For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ A Thread Performance Comparison: Windows NT and Solaris on A Symmetric Multiprocessor Fabian Zabatta and Kevin Ying Logic Based Systems Lab Brooklyn College and CUNY Graduate School Computer Science Department 2900 Bedford Avenue Brooklyn, New York 11210 {fabian, kevin}@sci.brooklyn.cuny.edu Abstract Manufacturers now have the capability to build high performance multiprocessor machines with common CPU CPU CPU CPU cache cache cache cache PC components. This has created a new market of low cost multiprocessor machines. However, these machines are handicapped unless they have an oper- Bus ating system that can take advantage of their under- lying architectures. Presented is a comparison of Shared Memory I/O two such operating systems, Windows NT and So- laris. By focusing on their implementation of Figure 1: A basic symmetric multiprocessor architecture. threads, we show each system's ability to exploit multiprocessors. We report our results and inter- pretations of several experiments that were used to compare the performance of each system. What 1.1 Symmetric Multiprocessing emerges is a discussion on the performance impact of each implementation and its significance on vari- Symmetric multiprocessing (SMP) is the primary ous types of applications. -
Scalable and High Performance MPI Design for Very Large
SCALABLE AND HIGH-PERFORMANCE MPI DESIGN FOR VERY LARGE INFINIBAND CLUSTERS DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Sayantan Sur, B. Tech ***** The Ohio State University 2007 Dissertation Committee: Approved by Prof. D. K. Panda, Adviser Prof. P. Sadayappan Adviser Prof. S. Parthasarathy Graduate Program in Computer Science and Engineering c Copyright by Sayantan Sur 2007 ABSTRACT In the past decade, rapid advances have taken place in the field of computer and network design enabling us to connect thousands of computers together to form high-performance clusters. These clusters are used to solve computationally challenging scientific problems. The Message Passing Interface (MPI) is a popular model to write applications for these clusters. There are a vast array of scientific applications which use MPI on clusters. As the applications operate on larger and more complex data, the size of the compute clusters is scaling higher and higher. Thus, in order to enable the best performance to these scientific applications, it is very critical for the design of the MPI libraries be extremely scalable and high-performance. InfiniBand is a cluster interconnect which is based on open-standards and gaining rapid acceptance. This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes. Methods developed in this dissertation have been applied towards reduction in overall resource consumption, increased overlap of computa- tion and communication, improved performance of collective operations and finally designing application-level benchmarks to make efficient use of modern networking technology. -
DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation
ARTICLES https://doi.org/10.1038/s41565-020-00813-z DNA scaffolds enable efficient and tunable functionalization of biomaterials for immune cell modulation Xiao Huang 1,2, Jasper Z. Williams 2,3, Ryan Chang1, Zhongbo Li1, Cassandra E. Burnett4, Rogelio Hernandez-Lopez2,3, Initha Setiady1, Eric Gai1, David M. Patterson5, Wei Yu2,3, Kole T. Roybal2,4, Wendell A. Lim2,3 ✉ and Tejal A. Desai 1,2 ✉ Biomaterials can improve the safety and presentation of therapeutic agents for effective immunotherapy, and a high level of control over surface functionalization is essential for immune cell modulation. Here, we developed biocompatible immune cell-engaging particles (ICEp) that use synthetic short DNA as scaffolds for efficient and tunable protein loading. To improve the safety of chimeric antigen receptor (CAR) T cell therapies, micrometre-sized ICEp were injected intratumorally to present a priming signal for systemically administered AND-gate CAR-T cells. Locally retained ICEp presenting a high density of priming antigens activated CAR T cells, driving local tumour clearance while sparing uninjected tumours in immunodeficient mice. The ratiometric control of costimulatory ligands (anti-CD3 and anti-CD28 antibodies) and the surface presentation of a cytokine (IL-2) on ICEp were shown to substantially impact human primary T cell activation phenotypes. This modular and versatile bio- material functionalization platform can provide new opportunities for immunotherapies. mmune cell therapies have shown great potential for cancer the specific antibody27, and therefore a high and controllable den- treatment, but still face major challenges of efficacy and safety sity of surface biomolecules could allow for precise control of ligand for therapeutic applications1–3, for example, chimeric antigen binding and cell modulation. -
Parallel Computing a Key to Performance
Parallel Computing A Key to Performance Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India http://www.cse.iitd.ac.in/~dheerajb Dheeraj Bhardwaj <[email protected]> August, 2002 1 Introduction • Traditional Science • Observation • Theory • Experiment -- Most expensive • Experiment can be replaced with Computers Simulation - Third Pillar of Science Dheeraj Bhardwaj <[email protected]> August, 2002 2 1 Introduction • If your Applications need more computing power than a sequential computer can provide ! ! ! ❃ Desire and prospect for greater performance • You might suggest to improve the operating speed of processors and other components. • We do not disagree with your suggestion BUT how long you can go ? Can you go beyond the speed of light, thermodynamic laws and high financial costs ? Dheeraj Bhardwaj <[email protected]> August, 2002 3 Performance Three ways to improve the performance • Work harder - Using faster hardware • Work smarter - - doing things more efficiently (algorithms and computational techniques) • Get help - Using multiple computers to solve a particular task. Dheeraj Bhardwaj <[email protected]> August, 2002 4 2 Parallel Computer Definition : A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”. Driving Forces and Enabling Factors Desire and prospect for greater performance Users have even bigger problems and designers have even more gates Dheeraj Bhardwaj <[email protected]> -
Parallelism in Cilk Plus
Cilk Plus: Language Support for Thread and Vector Parallelism Arch D. Robison Intel Sr. Principal Engineer Outline Motivation for Intel® Cilk Plus SIMD notations Fork-Join notations Karatsuba multiplication example GCC branch Copyright© 2012, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners. Multi-Threading and Vectorization are Essential to Performance Latest Intel® Xeon® chip: 8 cores 2 independent threads per core 8-lane (single prec.) vector unit per thread = 128-fold potential for single socket Intel® Many Integrated Core Architecture >50 cores (KNC) ? independent threads per core 16-lane (single prec.) vector unit per thread = parallel heaven Copyright© 2012, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners. Importance of Abstraction Software outlives hardware. Recompiling is easier than rewriting. Coding too closely to hardware du jour makes moving to new hardware expensive. C++ philosophy: abstraction with minimal penalty Do not expect compiler to be clever. But let it do tedious bookkeeping. Copyright© 2012, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners. “Three Layer Cake” Abstraction Message Passing exploit multiple nodes Fork-Join exploit multiple cores exploit parallelism at multiple algorithmic levels SIMD exploit vector hardware Copyright© 2012, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners. Composition Message Driven compose via send/receive Fork-Join compose via call/return SIMD compose sequentially Copyright© 2012, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners. -
User's Manual
rBOX610 Linux Software User’s Manual Disclaimers This manual has been carefully checked and believed to contain accurate information. Axiomtek Co., Ltd. assumes no responsibility for any infringements of patents or any third party’s rights, and any liability arising from such use. Axiomtek does not warrant or assume any legal liability or responsibility for the accuracy, completeness or usefulness of any information in this document. Axiomtek does not make any commitment to update the information in this manual. Axiomtek reserves the right to change or revise this document and/or product at any time without notice. No part of this document may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of Axiomtek Co., Ltd. Trademarks Acknowledgments Axiomtek is a trademark of Axiomtek Co., Ltd. ® Windows is a trademark of Microsoft Corporation. Other brand names and trademarks are the properties and registered brands of their respective owners. Copyright 2014 Axiomtek Co., Ltd. All Rights Reserved February 2014, Version A2 Printed in Taiwan ii Table of Contents Disclaimers ..................................................................................................... ii Chapter 1 Introduction ............................................. 1 1.1 Specifications ...................................................................................... 2 Chapter 2 Getting Started ...................................... -
A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor
A Cilk Implementation of LTE Base-station up- link on the TILEPro64 Processor Master of Science Thesis in the Programme of Integrated Electronic System Design HAO LI Chalmers University of Technology Department of Computer Science and Engineering Goteborg,¨ Sweden, November 2011 The Author grants to Chalmers University of Technology and University of Gothen- burg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet. The Author warrants that he/she is the au- thor to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for ex- ample a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet. A Cilk Implementation of LTE Base-station uplink on the TILEPro64 Processor HAO LI © HAO LI, November 2011 Examiner: Sally A. McKee Chalmers University of Technology Department of Computer Science and Engineering SE-412 96 Goteborg¨ Sweden Telephone +46 (0)31-255 1668 Supervisor: Magnus Sjalander¨ Chalmers University of Technology Department of Computer Science and Engineering SE-412 96 Goteborg¨ Sweden Telephone +46 (0)31-772 1075 Cover: A Cilk Implementation of LTE Base-station uplink on the TILEPro64 Processor.