Dense Linear Algebra Factorization in Openmp and Cilk Plus on Intel's

Total Page:16

File Type:pdf, Size:1020Kb

Dense Linear Algebra Factorization in Openmp and Cilk Plus on Intel's Dense Linear Algebra Factorization in OpenMP and Cilk Plus on Intel's MIC Architecture: Development Experiences and Performance Analysis John Eisenlohr (Ohio Supercomputer Center) David E. Hudak (Ohio Supercomputer Center) Karen Tomko (Ohio Supercomputer Center) Timothy C. Prince (Intel Corporation) Abstract Applications that generate large scale task-vector parallelism are good candidates for general purpose manycore architectures. We implemented variants of the communication-avoiding QR algorithm to test hardware utilization and run-time efficiency of OpenMP and Cilk Plus on Intel's MIC architecture. Preliminary results show similar performance for OpenMP and Cilk Plus and demonstrate a minimal impact for application-level restructuring on performance. 1 Introduction and Objectives Intel's Many Integrated Core (MIC) is a general-purpose manycore architecture supporting open- source operating systems and language runtimes (e.g., Linux, OpenMP and Cilk Plus). This run- time support makes initial application porting easier relative to special-purpose manycore alterna- tives; for example, existing OpenMP code need not be rewritten in CUDA. However, upcoming devices based on the MIC architecture will have in excess of fifty cores and support multiple threads per core, requiring hundreds of active threads for effective hardware utilization. Meeting this goal will require a combination of increased scalability at the OS/runtime level as well as new applica- tion tuning techniques. In order to begin studying these areas, we sought an algorithm that could create a large number of parallel tasks while providing flexibility in implementation. We selected QR factorization since it allows specification of problem size and has been extensively studied for every major HPC architectural model. There are a number of important research questions that must be answered for the MIC archi- tecture. What impact do the run-time systems of OpenMP and Cilk Plus have on performance? How can application-level restructuring impact performance? To examine these questions, we have implemented a number of variants of the communication-avoiding QR algorithm in both OpenMP and Cilk Plus and performed a preliminary set of experiments measuring both wall-clock perfor- mance and event-based sampling. Initial results demonstrate similar performance for OpenMP and Cilk Plus. 2 Related Work and Algorithm Implementation QR factorization is an important numerical technique and a tremendous amount of effort has gone into developing efficient QR algorithms and implementations for every major HPC architectural model: Vector (Linpack), SMP (LAPACK), Single-core Cluster (ScaLAPACK), Multicore Cluster (PLASMA), and GPU-Accelerated Cluster (MAGMA). Demel et. al. [3] describes the tall-skinny QR (TSQR) and communication-avoiding QR (CAQR) algorithms. Hadri et. al. [5] describes the problems of QR factorization for TS matrices and develop both the semi-parallel tile CAQR (SP-CAQR) and the fully parallel tile CAQR (FP-CAQR). Kurzak et. al. [6] utilizes both the host CPU cores and a GPU on SP-CAQR. Agullo et. al. [1] extends the work Hardri and Kurzak. They compare the performance of Tile QR and Tile CAQR, noting that CAQR "introduces parallelism in the panel factorization itself", i.e., by adding a binary tree reduction of partial factors. They examine the impact of tile granularity on performance and show that a dynamic scheduling-based implementation improves performance. Anderson et. al. [2] describes the CAQR algorithm running entirely on GPUs, including the panel factorization steps previously performed on CPUs [6]. Song et. al. [7] extends SP-CAQR and FP-CAQR [5] for distributed memory implementation with decentralized dynamic scheduling. The distributed-memory approach is similar to the multiple- GPU approach [1]: the matrix is thought of as D domains, with each process responsible for computing D/P domains. Dongarra et. al. [4] incorporates all prior approaches into a general- purpose hierarchical algorithm targeted toward distributed-memory clusters of SMP nodes. 2.1 Overview of the CAQR Algorithm The implementation of the communication-avoiding QR (CAQR) decomposition is done in C with factorization code offloaded to a Knights Ferry (KNF) coprocessor. The offload call transfers the input matrix to the device memory, converts the matrix to blocked format and stores the resulting factors on the device based on the assumption that downstream operations which make use of the decomposition will be offloaded to the KNF as well. At the highest level, the structure of the CAQR algorithm is to apply a set of four different kernels to the tiles that make up the matrix. These kernels are applied repeatedly, in a certain order, until the factorization is complete. Each kernel application is a task, and a great deal of research has been done to determine an ordering of the QR tasks that allows for maximum concurrency. The overall flow of the factorization is determined once we have fixed on the CAQR algorithm, but we are still free to make choices about the implementation that affect how well the program runs on the KNF. The implementation decisions are: 1. What is the best block size and shape to use? The lowest level subroutines should take best advantage of the KNF level 1 cache and 512-bit wide vector registers. 2. What is the highest level of concurrency the KNF can sustain for these kernels? For example, is it beneficial to run 4 threads per core? 3. How should we implement task scheduling in parallel sections of the algorithm? To make this decision, we need to know how well the KNF runtime schedules tasks in parallel sections and how much improvement can we get by customizing task-scheduling. 3 Implementing CAQR on the MIC Architecture For this study we adopt the CAQR algorithm as presented by Anderson, et. al. in [2]. At the highest level, the structure of the CAQR algorithm is to apply a set of four different kernels to the tiles that make up the matrix. These kernels, diagrammed in Figure 1, are: 1. factor: compute Householder vectors for a single block of the matrix. factor is applied within a loop over all blocks in a column of blocks. This loop is fully parallel. Applying factor to all the blocks in a column is called a factor panel operation. 2. factor tree: compute Householder vectors for a chunk of blocks. factor tree is applied within a loop over all chunks of blocks in a column of blocks. This loop is fully parallel. Applying factor tree to all the blocks in a column is called a factor panel tree operation. 2 Figure 1: CAQR Algorithm steps. Figure adapted from [2] 3. apply qt h: apply a block of Householder vectors horizontally across the trailing matrix. apply qt h is applied within a loop over an MxN array of blocks. This loop is fully parallel. Applying apply qt h to an array of blocks is called a trailing matrix update operation. 4. apply qt tree: apply Householder vectors from a chunk of blocks horizontally across chunks of blocks in the trailing matrix. apply qt tree is applied within a loop over an array of chunks of blocks. This loop is fully parallel. Applying apply qt tree to an array of chunks is called a trailing matrix tree update operation. These kernels are applied repeatedly, panel by panel, until the factorization is complete. 3.1 Task Parallelism with OpenMP and Cilk Plus To answer questions 1 and 2 posed in Sec 2.1 we vary the block size and maximum number of threads, then use Intels SEP and VTune to measure cache reuse, vectorization efficiency and cycles per instruction. To answer question 3 we explore the following implementations of task parallelism for the stages of the algorithm: 1. OpenMP CAQR (a) Simple OpenMP, using single level omp for in all four loops: factor panel trailing matrix update, factor panel tree and trailing matrix tree update. (b) Nested OpenMP, using nested omp for in the two-dimensional loops trailing matrix update and trailing matrix tree update. (c) Chunked OpenMP, comparing static and dynamic OMP task scheduling options and schedule chunk sizes for trailing matrix update and trailing matrix tree update. 2. Cilk Plus CAQR (a) Simple Cilk Plus, using single cilk for in factor panel and factor panel tree. (b) Nested Cilk Plus, using single cilk for in factor panel and factor panel tree with a nested cilk for in trailing matrix update and trailing matrix tree update. 3 Knights Ferry MIC 1 MIC 2 MIC p Xeon L1 L1 L1 L1 L1 L1 Core 0 Core 1 Core 5 L2 Cache L2 Cache L2 Cache L1 L1 L1 L1 L1 L1 L2 Cache L2 Cache L2 Cache Device Memory Knights Ferry Knights Ferry Knights Ferry Knights Ferry IB Card IB Card DMA Dual PCIe Dual PCIe L3 Cache L3Cache ... RAM RAM Xeon Xeon Xeon Xeon 6 cores 6 cores 6 cores 6 cores RAM RAM L3 Cache L3 Cache Cluster Node 1 Cluster Node n MPI InfiniBand Interconnection network Figure 2: MIC Architecture Software Development System block diagram 4 Experimental Results The experiments have been carried out on an Intel Knights Ferry Software Development system with a dual socket node containing two Intel Westmere (X5680) hex-core processors running at 3.33GHz, 24GB of memory, and two KNF co-processors, connected via PCI-e as shown in Figure 2. The host processors are running Red Hat Enterprise Linux Server release 6.1(Santiago), with kernel version 2.6.32-131.0.15.el6.x8 64. Intel's Sampling Enabling Product (SEP) is used to determine clock cycles for routines by using the CPU CLK UNHALTED counter and cache misses by using the DATA READ MISS OR WRITE MISS and DATA READ OR WRITE counters. One core is dedicated to system tasks, hence we restrict our experiments to 29 cores.
Recommended publications
  • Other Apis What’S Wrong with Openmp?
    Threaded Programming Other APIs What’s wrong with OpenMP? • OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. – cannot arbitrarily start/stop threads – cannot put threads to sleep and wake them up later • OpenMP is good for programs where each thread is doing (more-or-less) the same thing. • Although OpenMP supports C++, it’s not especially OO friendly – though it is gradually getting better. • OpenMP doesn’t support other popular base languages – e.g. Java, Python What’s wrong with OpenMP? (cont.) Can do this Can do this Can’t do this Threaded programming APIs • Essential features – a way to create threads – a way to wait for a thread to finish its work – a mechanism to support thread private data – some basic synchronisation methods – at least a mutex lock, or atomic operations • Optional features – support for tasks – more synchronisation methods – e.g. condition variables, barriers,... – higher levels of abstraction – e.g. parallel loops, reductions What are the alternatives? • POSIX threads • C++ threads • Intel TBB • Cilk • OpenCL • Java (not an exhaustive list!) POSIX threads • POSIX threads (or Pthreads) is a standard library for shared memory programming without directives. – Part of the ANSI/IEEE 1003.1 standard (1996) • Interface is a C library – no standard Fortran interface – can be used with C++, but not OO friendly • Widely available – even for Windows – typically installed as part of OS – code is pretty portable • Lots of low-level control over behaviour of threads • Lacks a proper memory consistency model Thread forking #include <pthread.h> int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void*(*start_routine, void*), void *arg) • Creates a new thread: – first argument returns a pointer to a thread descriptor.
    [Show full text]
  • Heterogeneous Task Scheduling for Accelerated Openmp
    Heterogeneous Task Scheduling for Accelerated OpenMP ? ? Thomas R. W. Scogland Barry Rountree† Wu-chun Feng Bronis R. de Supinski† ? Department of Computer Science, Virginia Tech, Blacksburg, VA 24060 USA † Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94551 USA [email protected] [email protected] [email protected] [email protected] Abstract—Heterogeneous systems with CPUs and computa- currently requires a programmer either to program in at tional accelerators such as GPUs, FPGAs or the upcoming least two different parallel programming models, or to use Intel MIC are becoming mainstream. In these systems, peak one of the two that support both GPUs and CPUs. Multiple performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the models however require code replication, and maintaining majority of programming models for heterogeneous computing two completely distinct implementations of a computational focus on only one of these. With the development of Accelerated kernel is a difficult and error-prone proposition. That leaves OpenMP for GPUs, both from PGI and Cray, we have a clear us with using either OpenCL or accelerated OpenMP to path to extend traditional OpenMP applications incrementally complete the task. to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they OpenCL’s greatest strength lies in its broad hardware do not preserve the former while adding the latter. Thus support. In a way, though, that is also its greatest weak- computational potential is wasted since either the CPU cores ness. To enable one to program this disparate hardware or the GPU cores are left idle.
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Parallel Programming
    Parallel Programming Libraries and implementations Outline • MPI – distributed memory de-facto standard • Using MPI • OpenMP – shared memory de-facto standard • Using OpenMP • CUDA – GPGPU de-facto standard • Using CUDA • Others • Hybrid programming • Xeon Phi Programming • SHMEM • PGAS MPI Library Distributed, message-passing programming Message-passing concepts Explicit Parallelism • In message-passing all the parallelism is explicit • The program includes specific instructions for each communication • What to send or receive • When to send or receive • Synchronisation • It is up to the developer to design the parallel decomposition and implement it • How will you divide up the problem? • When will you need to communicate between processes? Message Passing Interface (MPI) • MPI is a portable library used for writing parallel programs using the message passing model • You can expect MPI to be available on any HPC platform you use • Based on a number of processes running independently in parallel • HPC resource provides a command to launch multiple processes simultaneously (e.g. mpiexec, aprun) • There are a number of different implementations but all should support the MPI 2 standard • As with different compilers, there will be variations between implementations but all the features specified in the standard should work. • Examples: MPICH2, OpenMPI Point-to-point communications • A message sent by one process and received by another • Both processes are actively involved in the communication – not necessarily at the same time • Wide variety of semantics provided: • Blocking vs. non-blocking • Ready vs. synchronous vs. buffered • Tags, communicators, wild-cards • Built-in and custom data-types • Can be used to implement any communication pattern • Collective operations, if applicable, can be more efficient Collective communications • A communication that involves all processes • “all” within a communicator, i.e.
    [Show full text]
  • Openmp API 5.1 Specification
    OpenMP Application Programming Interface Version 5.1 November 2020 Copyright c 1997-2020 OpenMP Architecture Review Board. Permission to copy without fee all or part of this material is granted, provided the OpenMP Architecture Review Board copyright notice and the title of this document appear. Notice is given that copying is by permission of the OpenMP Architecture Review Board. This page intentionally left blank in published version. Contents 1 Overview of the OpenMP API1 1.1 Scope . .1 1.2 Glossary . .2 1.2.1 Threading Concepts . .2 1.2.2 OpenMP Language Terminology . .2 1.2.3 Loop Terminology . .9 1.2.4 Synchronization Terminology . 10 1.2.5 Tasking Terminology . 12 1.2.6 Data Terminology . 14 1.2.7 Implementation Terminology . 18 1.2.8 Tool Terminology . 19 1.3 Execution Model . 22 1.4 Memory Model . 25 1.4.1 Structure of the OpenMP Memory Model . 25 1.4.2 Device Data Environments . 26 1.4.3 Memory Management . 27 1.4.4 The Flush Operation . 27 1.4.5 Flush Synchronization and Happens Before .................. 29 1.4.6 OpenMP Memory Consistency . 30 1.5 Tool Interfaces . 31 1.5.1 OMPT . 32 1.5.2 OMPD . 32 1.6 OpenMP Compliance . 33 1.7 Normative References . 33 1.8 Organization of this Document . 35 i 2 Directives 37 2.1 Directive Format . 38 2.1.1 Fixed Source Form Directives . 43 2.1.2 Free Source Form Directives . 44 2.1.3 Stand-Alone Directives . 45 2.1.4 Array Shaping . 45 2.1.5 Array Sections .
    [Show full text]
  • Openmp Made Easy with INTEL® ADVISOR
    OpenMP made easy with INTEL® ADVISOR Zakhar Matveev, PhD, Intel CVCG, November 2018, SC’18 OpenMP booth Why do we care? Ai Bi Ai Bi Ai Bi Ai Bi Vector + Processing Ci Ci Ci Ci VL Motivation (instead of Agenda) • Starting from 4.x, OpenMP introduces support for both levels of parallelism: • Multi-Core (think of “pragma/directive omp parallel for”) • SIMD (think of “pragma/directive omp simd”) • 2 pillars of OpenMP SIMD programming model • Hardware with Intel ® AVX-512 support gives you theoretically 8x speed- up over SSE baseline (less or even more in practice) • Intel® Advisor is here to assist you in: • Enabling SIMD parallelism with OpenMP (if not yet) • Improving performance of already vectorized OpenMP SIMD code • And will also help to optimize for Memory Sub-system (Advisor Roofline) 3 Don’t use a single Vector lane! Un-vectorized and un-threaded software will under perform 4 Permission to Design for All Lanes Threading and Vectorization needed to fully utilize modern hardware 5 A Vector parallelism in x86 AVX-512VL AVX-512BW B AVX-512DQ AVX-512CD AVX-512F C AVX2 AVX2 AVX AVX AVX SSE SSE SSE SSE Intel® microarchitecture code name … NHM SNB HSW SKL 6 (theoretically) 8x more SIMD FLOP/S compared to your (–O2) optimized baseline x - Significant leap to 512-bit SIMD support for processors - Intel® Compilers and Intel® Math Kernel Library include AVX-512 support x - Strong compatibility with AVX - Added EVEX prefix enables additional x functionality Don’t leave it on the table! 7 Two level parallelism decomposition with OpenMP: image processing example B #pragma omp parallel for for (int y = 0; y < ImageHeight; ++y){ #pragma omp simd C for (int x = 0; x < ImageWidth; ++x){ count[y][x] = mandel(in_vals[y][x]); } } 8 Two level parallelism decomposition with OpenMP: fluid dynamics processing example B #pragma omp parallel for for (int i = 0; i < X_Dim; ++i){ #pragma omp simd C for (int m = 0; x < n_velocities; ++m){ next_i = f(i, velocities(m)); X[i] = next_i; } } 9 Key components of Intel® Advisor What’s new in “2019” release Step 1.
    [Show full text]
  • Parallelism in Cilk Plus
    Cilk Plus: Language Support for Thread and Vector Parallelism Arch D. Robison Intel Sr. Principal Engineer Outline Motivation for Intel® Cilk Plus SIMD notations Fork-Join notations Karatsuba multiplication example GCC branch Copyright© 2012, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners. Multi-Threading and Vectorization are Essential to Performance Latest Intel® Xeon® chip: 8 cores 2 independent threads per core 8-lane (single prec.) vector unit per thread = 128-fold potential for single socket Intel® Many Integrated Core Architecture >50 cores (KNC) ? independent threads per core 16-lane (single prec.) vector unit per thread = parallel heaven Copyright© 2012, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners. Importance of Abstraction Software outlives hardware. Recompiling is easier than rewriting. Coding too closely to hardware du jour makes moving to new hardware expensive. C++ philosophy: abstraction with minimal penalty Do not expect compiler to be clever. But let it do tedious bookkeeping. Copyright© 2012, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners. “Three Layer Cake” Abstraction Message Passing exploit multiple nodes Fork-Join exploit multiple cores exploit parallelism at multiple algorithmic levels SIMD exploit vector hardware Copyright© 2012, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners. Composition Message Driven compose via send/receive Fork-Join compose via call/return SIMD compose sequentially Copyright© 2012, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners.
    [Show full text]
  • Introduction to Openmp Paul Edmon ITC Research Computing Associate
    Introduction to OpenMP Paul Edmon ITC Research Computing Associate FAS Research Computing Overview • Threaded Parallelism • OpenMP Basics • OpenMP Programming • Benchmarking FAS Research Computing Threaded Parallelism • Shared Memory • Single Node • Non-uniform Memory Access (NUMA) • One thread per core FAS Research Computing Threaded Languages • PThreads • Python • Perl • OpenCL/CUDA • OpenACC • OpenMP FAS Research Computing OpenMP Basics FAS Research Computing What is OpenMP? • OpenMP (Open Multi-Processing) – Application Program Interface (API) – Governed by OpenMP Architecture Review Board • OpenMP provides a portable, scalable model for developers of shared memory parallel applications • The API supports C/C++ and Fortran on a wide variety of architectures FAS Research Computing Goals of OpenMP • Standardization – Provide a standard among a variety shared memory architectures / platforms – Jointly defined and endorsed by a group of major computer hardware and software vendors • Lean and Mean – Establish a simple and limited set of directives for programming shared memory machines – Significant parallelism can be implemented by just a few directives • Ease of Use – Provide the capability to incrementally parallelize a serial program • Portability – Specified for C/C++ and Fortran – Most majors platforms have been implemented including Unix/Linux and Windows – Implemented for all major compilers FAS Research Computing OpenMP Programming Model Shared Memory Model: OpenMP is designed for multi-processor/core, shared memory machines Thread Based Parallelism: OpenMP programs accomplish parallelism exclusively through the use of threads Explicit Parallelism: OpenMP provides explicit (not automatic) parallelism, offering the programmer full control over parallelization Compiler Directive Based: Parallelism is specified through the use of compiler directives embedded in the C/C++ or Fortran code I/O: OpenMP specifies nothing about parallel I/O.
    [Show full text]
  • Parallel Programming with Openmp
    Parallel Programming with OpenMP OpenMP Parallel Programming Introduction: OpenMP Programming Model Thread-based parallelism utilized on shared-memory platforms Parallelization is either explicit, where programmer has full control over parallelization or through using compiler directives, existing in the source code. Thread is a process of a code is being executed. A thread of execution is the smallest unit of processing. Multiple threads can exist within the same process and share resources such as memory OpenMP Parallel Programming Introduction: OpenMP Programming Model Master thread is a single thread that runs sequentially; parallel execution occurs inside parallel regions and between two parallel regions, only the master thread executes the code. This is called the fork-join model: OpenMP Parallel Programming OpenMP Parallel Computing Hardware Shared memory allows immediate access to all data from all processors without explicit communication. Shared memory: multiple cpus are attached to the BUS all processors share the same primary memory the same memory address on different CPU's refer to the same memory location CPU-to-memory connection becomes a bottleneck: shared memory computers cannot scale very well OpenMP Parallel Programming OpenMP versus MPI OpenMP (Open Multi-Processing): easy to use; loop-level parallelism non-loop-level parallelism is more difficult limited to shared memory computers cannot handle very large problems An alternative is MPI (Message Passing Interface): require low-level programming; more difficult programming
    [Show full text]
  • A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
    TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization Blake A. Hechtman, Andrew D. Hilton, and Daniel J. Sorin Department of Electrical and Computer Engineering Duke University Abstract —We have developed a task-parallel runtime targeting CPUs are a poor fit for GPUs. To understand system, called TREES, that is designed for high why this mismatch exists, we must first understand the performance on CPU/GPU platforms. On platforms performance of an idealized task-parallel application with multiple CPUs, Cilk’s “work-first” principle (with no runtime) and then how the runtime’s overhead underlies how task-parallel applications can achieve affects it. The performance of a task-parallel application performance, but work-first is a poor fit for GPUs. We is a function of two characteristics: its total amount of build upon work-first to create the “work-together” work to be performed (T1, the time to execute on 1 principle that addresses the specific strengths and processor) and its critical path (T∞, the time to execute weaknesses of GPUs. The work-together principle on an infinite number of processors). Prior work has extends work-first by stating that (a) the overhead on shown that the runtime of a system with P processors, the critical path should be paid by the entire system at TP, is bounded by = ( ) + ( ) due to the once and (b) work overheads should be paid co- greedy o ff line scheduler bound [3][10]. operatively. We have implemented the TREES runtime A task-parallel runtime introduces overheads and, for in OpenCL, and we experimentally evaluate TREES purposes of performance analysis, we distinguish applications on a CPU/GPU platform.
    [Show full text]
  • Intel Threading Building Blocks
    Praise for Intel Threading Building Blocks “The Age of Serial Computing is over. With the advent of multi-core processors, parallel- computing technology that was once relegated to universities and research labs is now emerging as mainstream. Intel Threading Building Blocks updates and greatly expands the ‘work-stealing’ technology pioneered by the MIT Cilk system of 15 years ago, providing a modern industrial-strength C++ library for concurrent programming. “Not only does this book offer an excellent introduction to the library, it furnishes novices and experts alike with a clear and accessible discussion of the complexities of concurrency.” — Charles E. Leiserson, MIT Computer Science and Artificial Intelligence Laboratory “We used to say make it right, then make it fast. We can’t do that anymore. TBB lets us design for correctness and speed up front for Maya. This book shows you how to extract the most benefit from using TBB in your code.” — Martin Watt, Senior Software Engineer, Autodesk “TBB promises to change how parallel programming is done in C++. This book will be extremely useful to any C++ programmer. With this book, James achieves two important goals: • Presents an excellent introduction to parallel programming, illustrating the most com- mon parallel programming patterns and the forces governing their use. • Documents the Threading Building Blocks C++ library—a library that provides generic algorithms for these patterns. “TBB incorporates many of the best ideas that researchers in object-oriented parallel computing developed in the last two decades.” — Marc Snir, Head of the Computer Science Department, University of Illinois at Urbana-Champaign “This book was my first introduction to Intel Threading Building Blocks.
    [Show full text]
  • An Overview of Parallel Ccomputing
    An Overview of Parallel Ccomputing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA MPI Hardware Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA MPI Hardware von Neumann Architecture In 1945, the Hungarian mathematician John von Neumann proposed the above organization for hardware computers. The Control Unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. The Arithmetic Unit performs basic arithmetic operation, while Input/Output is the interface to the human operator. Hardware von Neumann Architecture The Pentium Family. Hardware Parallel computer hardware Most computers today (including tablets, smartphones, etc.) are equipped with several processing units (control+arithmetic units). Various characteristics determine the types of computations: shared memory vs distributed memory, single-core processors vs multicore processors, data-centric parallelism vs task-centric parallelism. Historically, shared memory machines have been classified as UMA and NUMA, based upon memory access times. Hardware Uniform memory access (UMA) Identical processors, equal access and access times to memory. In the presence of cache memories, cache coherency is accomplished at the hardware level: if one processor updates a location in shared memory, then all the other processors know about the update. UMA architectures were first represented by Symmetric Multiprocessor (SMP) machines. Multicore processors follow the same architecture and, in addition, integrate the cores onto a single circuit die. Hardware Non-uniform memory access (NUMA) Often made by physically linking two or more SMPs (or multicore processors).
    [Show full text]