Multi-Core Programming
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Designing an Ultra Low-Overhead Multithreading Runtime for Nim
Designing an ultra low-overhead multithreading runtime for Nim Mamy Ratsimbazafy Weave [email protected] https://github.com/mratsim/weave Hello! I am Mamy Ratsimbazafy During the day blockchain/Ethereum 2 developer (in Nim) During the night, deep learning and numerical computing developer (in Nim) and data scientist (in Python) You can contact me at [email protected] Github: mratsim Twitter: m_ratsim 2 Where did this talk came from? ◇ 3 years ago: started writing a tensor library in Nim. ◇ 2 threading APIs at the time: OpenMP and simple threadpool ◇ 1 year ago: complete refactoring of the internals 3 Agenda ◇ Understanding the design space ◇ Hardware and software multithreading: definitions and use-cases ◇ Parallel APIs ◇ Sources of overhead and runtime design ◇ Minimum viable runtime plan in a weekend 4 Understanding the 1 design space Concurrency vs parallelism, latency vs throughput Cooperative vs preemptive, IO vs CPU 5 Parallelism is not 6 concurrency Kernel threading 7 models 1:1 Threading 1 application thread -> 1 hardware thread N:1 Threading N application threads -> 1 hardware thread M:N Threading M application threads -> N hardware threads The same distinctions can be done at a multithreaded language or multithreading runtime level. 8 The problem How to schedule M tasks on N hardware threads? Latency vs 9 Throughput - Do we want to do all the work in a minimal amount of time? - Numerical computing - Machine learning - ... - Do we want to be fair? - Clients-server - Video decoding - ... Cooperative vs 10 Preemptive Cooperative multithreading: -
2.5 Classification of Parallel Computers
52 // Architectures 2.5 Classification of Parallel Computers 2.5 Classification of Parallel Computers 2.5.1 Granularity In parallel computing, granularity means the amount of computation in relation to communication or synchronisation Periods of computation are typically separated from periods of communication by synchronization events. • fine level (same operations with different data) ◦ vector processors ◦ instruction level parallelism ◦ fine-grain parallelism: – Relatively small amounts of computational work are done between communication events – Low computation to communication ratio – Facilitates load balancing 53 // Architectures 2.5 Classification of Parallel Computers – Implies high communication overhead and less opportunity for per- formance enhancement – If granularity is too fine it is possible that the overhead required for communications and synchronization between tasks takes longer than the computation. • operation level (different operations simultaneously) • problem level (independent subtasks) ◦ coarse-grain parallelism: – Relatively large amounts of computational work are done between communication/synchronization events – High computation to communication ratio – Implies more opportunity for performance increase – Harder to load balance efficiently 54 // Architectures 2.5 Classification of Parallel Computers 2.5.2 Hardware: Pipelining (was used in supercomputers, e.g. Cray-1) In N elements in pipeline and for 8 element L clock cycles =) for calculation it would take L + N cycles; without pipeline L ∗ N cycles Example of good code for pipelineing: §doi =1 ,k ¤ z ( i ) =x ( i ) +y ( i ) end do ¦ 55 // Architectures 2.5 Classification of Parallel Computers Vector processors, fast vector operations (operations on arrays). Previous example good also for vector processor (vector addition) , but, e.g. recursion – hard to optimise for vector processors Example: IntelMMX – simple vector processor. -
Fiber Context As a first-Class Object
Document number: P0876R6 Date: 2019-06-17 Author: Oliver Kowalke ([email protected]) Nat Goodspeed ([email protected]) Audience: SG1 fiber_context - fibers without scheduler Revision History . .1 abstract . .2 control transfer mechanism . .2 std::fiber_context as a first-class object . .3 encapsulating the stack . .4 invalidation at resumption . .4 problem: avoiding non-const global variables and undefined behaviour . .5 solution: avoiding non-const global variables and undefined behaviour . .6 inject function into suspended fiber . 11 passing data between fibers . 12 termination . 13 exceptions . 16 std::fiber_context as building block for higher-level frameworks . 16 interaction with STL algorithms . 18 possible implementation strategies . 19 fiber switch on architectures with register window . 20 how fast is a fiber switch . 20 interaction with accelerators . 20 multi-threading environment . 21 acknowledgments . 21 API ................................................................ 22 33.7 Cooperative User-Mode Threads . 22 33.7.1 General . 22 33.7.2 Empty vs. Non-Empty . 22 33.7.3 Explicit Fiber vs. Implicit Fiber . 22 33.7.4 Implicit Top-Level Function . 22 33.7.5 Header <experimental/fiber_context> synopsis . 23 33.7.6 Class fiber_context . 23 33.7.7 Function unwind_fiber() . 27 references . 29 Revision History This document supersedes P0876R5. Changes since P0876R5: • std::unwind_exception removed: stack unwinding must be performed by platform facilities. • fiber_context::can_resume_from_any_thread() renamed to can_resume_from_this_thread(). • fiber_context::valid() renamed to empty() with inverted sense. • Material has been added concerning the top-level wrapper logic governing each fiber. The change to unwinding fiber stacks using an anonymous foreign exception not catchable by C++ try / catch blocks is in response to deep discussions in Kona 2019 of the surprisingly numerous problems surfaced by using an ordinary C++ exception for that purpose. -
Fiber-Based Architecture for NFV Cloud Databases
Fiber-based Architecture for NFV Cloud Databases Vaidas Gasiunas, David Dominguez-Sal, Ralph Acker, Aharon Avitzur, Ilan Bronshtein, Rushan Chen, Eli Ginot, Norbert Martinez-Bazan, Michael Muller,¨ Alexander Nozdrin, Weijie Ou, Nir Pachter, Dima Sivov, Eliezer Levy Huawei Technologies, Gauss Database Labs, European Research Institute fvaidas.gasiunas,david.dominguez1,fi[email protected] ABSTRACT to a specific workload, as well as to change this alloca- The telco industry is gradually shifting from using mono- tion on demand. Elasticity is especially attractive to op- lithic software packages deployed on custom hardware to erators who are very conscious about Total Cost of Own- using modular virtualized software functions deployed on ership (TCO). NFV meets databases (DBs) in more than cloudified data centers using commodity hardware. This a single aspect [20, 30]. First, the NFV concepts of modu- transformation is referred to as Network Function Virtual- lar software functions accelerate the separation of function ization (NFV). The scalability of the databases (DBs) un- (business logic) and state (data) in the design and imple- derlying the virtual network functions is the cornerstone for mentation of network functions (NFs). This in turn enables reaping the benefits from the NFV transformation. This independent scalability of the DB component, and under- paper presents an industrial experience of applying shared- lines its importance in providing elastic state repositories nothing techniques in order to achieve the scalability of a DB for various NFs. In principle, the NFV-ready DBs that are in an NFV setup. The special combination of requirements relevant to our discussion are sophisticated main-memory in NFV DBs are not easily met with conventional execution key-value (KV) stores with stringent high availability (HA) models. -
A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor
The following paper was originally published in the Proceedings of the 2nd USENIX Windows NT Symposium Seattle, Washington, August 3–4, 1998 A Thread Performance Comparison: Windows NT and Solaris on A Symmetric Multiprocessor Fabian Zabatta and Kevin Ying Brooklyn College and CUNY Graduate School For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ A Thread Performance Comparison: Windows NT and Solaris on A Symmetric Multiprocessor Fabian Zabatta and Kevin Ying Logic Based Systems Lab Brooklyn College and CUNY Graduate School Computer Science Department 2900 Bedford Avenue Brooklyn, New York 11210 {fabian, kevin}@sci.brooklyn.cuny.edu Abstract Manufacturers now have the capability to build high performance multiprocessor machines with common CPU CPU CPU CPU cache cache cache cache PC components. This has created a new market of low cost multiprocessor machines. However, these machines are handicapped unless they have an oper- Bus ating system that can take advantage of their under- lying architectures. Presented is a comparison of Shared Memory I/O two such operating systems, Windows NT and So- laris. By focusing on their implementation of Figure 1: A basic symmetric multiprocessor architecture. threads, we show each system's ability to exploit multiprocessors. We report our results and inter- pretations of several experiments that were used to compare the performance of each system. What 1.1 Symmetric Multiprocessing emerges is a discussion on the performance impact of each implementation and its significance on vari- Symmetric multiprocessing (SMP) is the primary ous types of applications. -
Computer Hardware Architecture Lecture 4
Computer Hardware Architecture Lecture 4 Manfred Liebmann Technische Universit¨atM¨unchen Chair of Optimal Control Center for Mathematical Sciences, M17 [email protected] November 10, 2015 Manfred Liebmann November 10, 2015 Reading List • Pacheco - An Introduction to Parallel Programming (Chapter 1 - 2) { Introduction to computer hardware architecture from the parallel programming angle • Hennessy-Patterson - Computer Architecture - A Quantitative Approach { Reference book for computer hardware architecture All books are available on the Moodle platform! Computer Hardware Architecture 1 Manfred Liebmann November 10, 2015 UMA Architecture Figure 1: A uniform memory access (UMA) multicore system Access times to main memory is the same for all cores in the system! Computer Hardware Architecture 2 Manfred Liebmann November 10, 2015 NUMA Architecture Figure 2: A nonuniform memory access (UMA) multicore system Access times to main memory differs form core to core depending on the proximity of the main memory. This architecture is often used in dual and quad socket servers, due to improved memory bandwidth. Computer Hardware Architecture 3 Manfred Liebmann November 10, 2015 Cache Coherence Figure 3: A shared memory system with two cores and two caches What happens if the same data element z1 is manipulated in two different caches? The hardware enforces cache coherence, i.e. consistency between the caches. Expensive! Computer Hardware Architecture 4 Manfred Liebmann November 10, 2015 False Sharing The cache coherence protocol works on the granularity of a cache line. If two threads manipulate different element within a single cache line, the cache coherency protocol is activated to ensure consistency, even if every thread is only manipulating its own data. -
Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again. -
Broadwell Skylake Next Gen* NEW Intel NEW Intel NEW Intel Microarchitecture Microarchitecture Microarchitecture
15 лет доступности IOTG is extending the product availability for IOTG roadmap products from a minimum of 7 years to a minimum of 15 years when both processor and chipset are on 22nm and newer process technologies. - Xeon Scalable (w/ chipsets) - E3-12xx/15xx v5 and later (w/ chipsets) - 6th gen Core and later (w/ chipsets) - Bay Trail (E3800) and later products (Braswell, N3xxx) - Atom C2xxx (Rangeley) and later - Не включает в себя Xeon-D (7 лет) и E5-26xx v4 (7 лет) 2 IOTG Product Availability Life-Cycle 15 year product availability will start with the following products: Product Discontinuance • Intel® Xeon® Processor Scalable Family codenamed Skylake-SP and later with associated chipsets Notification (PDN)† • Intel® Xeon® E3-12xx/15xx v5 series (Skylake) and later with associated chipsets • 6th Gen Intel® Core™ processor family (Skylake) and later (includes Intel® Pentium® and Celeron® processors) with PDNs will typically be issued no later associated chipsets than 13.5 years after component • Intel Pentium processor N3700 (Braswell) and later and Intel Celeron processors N3xxx (Braswell) and J1900/N2xxx family introduction date. PDNs are (Bay Trail) and later published at https://qdms.intel.com/ • Intel® Atom® processor C2xxx (Rangeley) and E3800 family (Bay Trail) and late Last 7 year product availability Time Last Last Order Ship Last 15 year product availability Time Last Last Order Ship L-1 L L+1 L+2 L+3 L+4 L+5 L+6 L+7 L+8 L+9 L+10 L+11 L+12 L+13 L+14 L+15 Years Introduction of component family † Intel may support this extended manufacturing using reasonably Last Time Order/Ship Periods Component family introduction dates are feasible means deemed by Intel to be appropriate. -
Intel® Audience Impression Metric (AIM) Suite User Guide July 2011 2 Document Number: 465720-1.1
Intel® Audience Impression Metric (AIM) Suite User Guide July 2011 Revision 1.1 Document Number: 465720 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. -
Intel® E7501 Chipset for Embedded Computing
Product Brief Intel® E7501 Chipset for Embedded Computing Product Overview The Intel® E7501 chipset represents the next step in high-performance chipset technology, supporting single- and dual-processor platforms optimized for the Intel® Xeon™ processor and Low Voltage Intel® Xeon™ processor. It also supports uni-processor platforms optimized for the Intel® Pentium® M processor. The design delivers maximized system bus, memory, and I/O bandwidth to enhance performance, scalability, and end-user productivity while providing a smooth transition to next-generation technologies. Features that Maximize Performance of an Intel® E7501 Chipset-based Platform I Dual Intel Xeon processors and a 400/533 MHz I Single or dual DDR266 memory channels system bus provide up to 4.3 GB/s of available provide up to 4.3 GB/s of memory bandwidth bandwidth I Three hub interface 2.0 connections provide Intel in I Intel Pentium M processors and a 400 MHz multiple high-bandwidth I/O configuration Communications system bus provide up to 3.2 GB/s of available options, yielding up to 3.2 GB/s of I/O bandwidth bandwidth Features Benefits Supports one or two Intel® Xeon™ processors I Intel® NetBurst™ microarchitecture and the Hyper-Threading Technology of or Low Voltage Intel® Xeon™ processors the Intel Xeon processor combine to deliver world-class performance. Supports one Intel® Pentium® M processor I New microarchitecture supports low power and high performance. 400/533 MHz system bus capability I Up to 4.3 GB/s system bus bandwidth for increased memory and I/O throughput. Intel hub architecture 2.0 connection I Point-to-point connection between the MCH and up to three P64H2 hub to the Memory Controller Hub (MCH) devices provides up to 3.2 GB/s of bandwidth. -
Intel Unveils All New 2010 Intel® Core™ Processor Family
Intel Unveils All New 2010 Intel® Core™ Processor Family Industry's Smartest, Most Advanced Technology Available in Variety of Price Points LAS VEGAS, Jan 07, 2010 (BUSINESS WIRE) -- Intel Corporation: NEWS HIGHLIGHTS 1 ● Mainstream processors now offer Intel(R) Turbo Boost Technology , automatically adapting to an individual's performance needs ● First 32 nanometer processors and first time Intel is mass-producing a variety of chips at mainstream prices at start of new manufacturing process, reflecting last year's $7 billion investment during economic recession 4 ● Intel(R) Core(TM) i5 processors are about twice as fast as comparable existing PCs for visibly faster video, photo and music downloading experience ● Historic milestone: select processors integrate graphics directly on processors; also include Intel's second generation high-k metal gate transistors ● Beyond laptops and PCs, processors also target ATMs, travel kiosks, digital displays ● More than 10 new chipsets and new 802.11n WiFi and WiMAX products with new Intel(R) My WiFi features Intel Corporation introduced its all new 2010 Intel(R) Core(TM) family of processors today, delivering unprecedented integration and smart performance, including Intel(R) Turbo Boost Technology1 for laptops, desktops and embedded devices. The introduction of new Intel(R) Core(TM) i7, i5 and i3 chips coincides with the arrival of Intel's groundbreaking new 32 nanometer (nm) manufacturing process - which for the first time in the company's history - will be used to immediately produce and deliver processors and features at a variety of price points, and integrate high-definition graphics inside the processor. This unprecedented ramp and innovation reflects Intel's $7 billion investment announced early last year in the midst of a major global economic recession. -
Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2
Overview SIMD and MIMD in the Multicore Context Single Instruction Multiple Instruction ● (note: Tute 02 this Weds - handouts) ● Flynn’s Taxonomy Single Data SISD MISD ● multicore architecture concepts Multiple Data SIMD MIMD ● for SIMD, the control unit and processor state (registers) can be shared ■ hardware threading ■ SIMD vs MIMD in the multicore context ● however, SIMD is limited to data parallelism (through multiple ALUs) ■ ● T2: design features for multicore algorithms need a regular structure, e.g. dense linear algebra, graphics ■ SSE2, Altivec, Cell SPE (128-bit registers); e.g. 4×32-bit add ■ system on a chip Rx: x x x x ■ 3 2 1 0 execution: (in-order) pipeline, instruction latency + ■ thread scheduling Ry: y3 y2 y1 y0 ■ caches: associativity, coherence, prefetch = ■ memory system: crossbar, memory controller Rz: z3 z2 z1 z0 (zi = xi + yi) ■ intermission ■ design requires massive effort; requires support from a commodity environment ■ speculation; power savings ■ massive parallelism (e.g. nVidia GPGPU) but memory is still a bottleneck ■ OpenSPARC ● multicore (CMT) is MIMD; hardware threading can be regarded as MIMD ● T2 performance (why the T2 is designed as it is) ■ higher hardware costs also includes larger shared resources (caches, TLBs) ● the Rock processor (slides by Andrew Over; ref: Tremblay, IEEE Micro 2009 ) needed ⇒ less parallelism than for SIMD COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 1 COMP8320 Lecture 2: Multicore Architecture and the T2 2011 ◭◭◭ • ◮◮◮ × 3 Hardware (Multi)threading The UltraSPARC T2: System on a Chip ● recall concurrent execution on a single CPU: switch between threads (or ● OpenSparc Slide Cast Ch 5: p79–81,89 processes) requires the saving (in memory) of thread state (register values) ● aggressively multicore: 8 cores, each with 8-way hardware threading (64 virtual ■ motivation: utilize CPU better when thread stalled for I/O (6300 Lect O1, p9–10) CPUs) ■ what are the costs? do the same for smaller stalls? (e.g.