Introduction to Intel Cilk

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Intel Cilk Introduction to Intel Cilk Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 2014/3/5 1 Agenda • Cilk Keywords • Load Balancing • Reducer • Summary Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 3/5/2014 2 Cilk keywords • Cilk adds three keywords to C and C++: _Cilk_spawn _Cilk_sync _Cilk_for • If you #include <cilk/cilk.h>, you can write the keywords as cilk_spawn, cilk_sync, and cilk_for. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_spawn and cilk_sync • cilk_spawn (or _Cilk_spawn) gives the runtime permission to run a child function asynchronously. – No 2nd thread is created or required! – If there are no available workers, then the child will execute as a serial function call. – The scheduler may steal the parent and run it in parallel with the child function. – The parent is not guaranteed to run in parallel with the child. • cilk_sync (or _Cilk_sync) waits for all children to complete. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Anatomy of a spawn void f() void g() { spawn { cilk_spawn g(); work spawned work work function work continuation work (child) work } cilk_sync; spawning spawning work sync }(parent) function Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Work Stealing when another worker is available void f() void g() { { cilk_spawn g(); work work steal! work work work work } cilk_sync; work } Worker Worker A B Worker ? Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Load Balancing •The work-stealing scheduler automatically load-balances: –An idle worker will find work to do. –If the program has enough parallelism, then all workers will stay busy. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Quicksort Example void qsort(int* begin, int* end) { if (begin != end) { int* pivot = end – 1; int* middle = std::partition(begin, pivot, std::bind2nd(std::less<int>(), *pivot)); using std::swap; swap(*pivot, *middle); // move pivot to middle cilk_spawn qsort(begin, middle); qsort(middle + 1, end); cilk_sync; } } divide-and conquer asynchronous recursion Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Parallelism • Experiments show that using qsort on 100,000,000 integers will get linear speedup up to about 7 processors. • Why doesn’t the speedup continue to 8 or more processors? • qsort has only enough parallelism to keep 7 processors busy. – The spawned recursion adds parallelism but… – The serial partition increases the span • Formally: parallelism = the total work divided by the work within the longest serial path (span). Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Work-stealing Overheads • Spawning is cheap (3-5 times the cost of a function call) – Spawn early and often. – Optimal scheduling requires that parallelism be about an order of magnitude greater than the actual number of cores. • Stealing is much more expensive (requires locks and memory barriers) • Most spawns do not result in steals. • The more balanced the work load, the less stealing there is and hence the less overhead. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for loop • Looks like a normal for loop. cilk_for (int x = 0; x < 1000000; ++x) { … } • Any or all iterations may execute in parallel with one another. • All iterations complete before program continues. • Constraints: – Limited to a single control variable. – Must be able to jump to the start of any iteration at random. – Iterations should be independent of one another. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Implementation of cilk_for cilk_for (int i=0; i< 8; ++i) f(i); 0 - 7 spawn continuation 0 - 3 4 - 7 spawn continuation spawn continuation 0 - 1 2 - 3 4 - 5 6 - 7 0 1 2 3 4 5 6 7 Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for vs. serial for with spawn • Compare the following loops: for (int x = 0; x < n; ++x) { cilk_spawn f(x); } cilk_for (int x = 0; x < n; ++x) { f(x); } • The above two loops have similar semantics, but… • they have very different performance characteristics. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serial for with spawn: unbalanced steal! 0 - 7 Worker Worker A B spawn 1 - 7 steal! spawn steal! 0 2 - 7 steal! steal! 1 steal! 3 - 7 4 - 7 5 - 7 steal! 6 - 7 If work per 2 iteration is small 3 7 - 7 4 then steal overhead can 5 6 be significant 7 Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for: Divide and Conquer Worker Worker A 0 - 7 B steal! spawn 0 - 3 4 - 7 spaw n 0 - 1 2 - 3 4 - 5 6 - 7 return 0 1 2 3 4 5 6 7 Divide and conquer results if few steals and less overhead. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. cilk_for examples • cilk_for (int x; x < 1000000; x += 2) { … } • cilk_for (vector<int>::iterator x = y.begin(); x != y.end(); ++x) { … } • cilk_for (list<int>::iterator x = y.begin(); x != y.end(); ++x) { … } – Loop count cannot be computed in constant time for a list. (y.end() – y.begin() is not defined.) – Do not have random access to the elements of the list. (y.begin() + n is not defined.) Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Grain Size • If the work per iteration of a cilk_for is sufficiently small, even the spawn overhead can become noticeable. • To reduce the overhead, cilk_for chunks the loop into “grains.” • The default grain size will yield good performance in most cases. – Default grain size heuristic: N/8p, where N is the number of loop iterations and p is the number of workers. – This heuristic was produces sufficient parallel slackness for the work- stealing scheduler on loops that are not radically unbalanced. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. #pragma cilk grainsize • The programmer may choose a grain size explicitly: #pragma cilk grainsize = expression cilk_for (…) • Pragma is most useful for setting the grain size to 1 for large, unbalanced loops. • If grainsize is set too small for short loops, spawn overhead reduces performance. • If grainsize is set too large, parallelism will be lost. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serialization • Every Cilk program has an equivalent serial program called the serialization • The serialization is obtained by removing cilk_spawn and cilk_sync keywords and replacing cilk_for with for – The compiler will produce the serialization for you if you compile with /Qcilk-serialize (Windows) or -cilk- serialize (Linux) • Running with only one worker is equivalent to running the serialization. Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Serial Semantics • A deterministic Cilk program will have the same semantics as its serialization. – Easier regression testing – Easier to debug: – Run with one core – Run serialized – Composable – Strong analysis tools (Cilk-specific versions will be posted on WhatIf) – race detector – parallelism analyzer Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Implicit syncs void f() { cilk_spawn g(); cilk_for (int x = 0; x < lots; ++x) { ... } At end of a cilk_for body (does not sync g()) try { cilk_spawnBefore h(); entering a try block containing a sync } catch (...) At{ end of a try block containing a spawn ... } } At end of a spawning function Software AND Services Copyright © 2014, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reducer Library •A variable that can be safely used by multiple strands running in parallel. •Cilk’s hyperobject library contains many commonly used reducers: • reducer_list_append, reducer_list_prepend, • reducer_maxint, main(reducer_max_indexint argc, char* argv[]) • reducer_min{ , reducer_min_index • reducer_opadd , reducer_ostreamunsigned int n = ,1000000; reducer_basic_string cilk…:: reducer_opadd<unsigned
Recommended publications
  • Other Apis What’S Wrong with Openmp?
    Threaded Programming Other APIs What’s wrong with OpenMP? • OpenMP is designed for programs where you want a fixed number of threads, and you always want the threads to be consuming CPU cycles. – cannot arbitrarily start/stop threads – cannot put threads to sleep and wake them up later • OpenMP is good for programs where each thread is doing (more-or-less) the same thing. • Although OpenMP supports C++, it’s not especially OO friendly – though it is gradually getting better. • OpenMP doesn’t support other popular base languages – e.g. Java, Python What’s wrong with OpenMP? (cont.) Can do this Can do this Can’t do this Threaded programming APIs • Essential features – a way to create threads – a way to wait for a thread to finish its work – a mechanism to support thread private data – some basic synchronisation methods – at least a mutex lock, or atomic operations • Optional features – support for tasks – more synchronisation methods – e.g. condition variables, barriers,... – higher levels of abstraction – e.g. parallel loops, reductions What are the alternatives? • POSIX threads • C++ threads • Intel TBB • Cilk • OpenCL • Java (not an exhaustive list!) POSIX threads • POSIX threads (or Pthreads) is a standard library for shared memory programming without directives. – Part of the ANSI/IEEE 1003.1 standard (1996) • Interface is a C library – no standard Fortran interface – can be used with C++, but not OO friendly • Widely available – even for Windows – typically installed as part of OS – code is pretty portable • Lots of low-level control over behaviour of threads • Lacks a proper memory consistency model Thread forking #include <pthread.h> int pthread_create( pthread_t *thread, const pthread_attr_t *attr, void*(*start_routine, void*), void *arg) • Creates a new thread: – first argument returns a pointer to a thread descriptor.
    [Show full text]
  • Threads Chapter 4
    Threads Chapter 4 Reading: 4.1,4.4, 4.5 1 Process Characteristics ● Unit of resource ownership - process is allocated: ■ a virtual address space to hold the process image ■ control of some resources (files, I/O devices...) ● Unit of dispatching - process is an execution path through one or more programs ■ execution may be interleaved with other process ■ the process has an execution state and a dispatching priority 2 Process Characteristics ● These two characteristics are treated independently by some recent OS ● The unit of dispatching is usually referred to a thread or a lightweight process ● The unit of resource ownership is usually referred to as a process or task 3 Multithreading vs. Single threading ● Multithreading: when the OS supports multiple threads of execution within a single process ● Single threading: when the OS does not recognize the concept of thread ● MS-DOS support a single user process and a single thread ● UNIX supports multiple user processes but only supports one thread per process ● Solaris /NT supports multiple threads 4 Threads and Processes 5 Processes Vs Threads ● Have a virtual address space which holds the process image ■ Process: an address space, an execution context ■ Protected access to processors, other processes, files, and I/O Class threadex{ resources Public static void main(String arg[]){ ■ Context switch between Int x=0; processes expensive My_thread t1= new my_thread(x); t1.start(); ● Threads of a process execute in Thr_wait(); a single address space System.out.println(x) ■ Global variables are
    [Show full text]
  • Lecture 10: Introduction to Openmp (Part 2)
    Lecture 10: Introduction to OpenMP (Part 2) 1 Performance Issues I • C/C++ stores matrices in row-major fashion. • Loop interchanges may increase cache locality { … #pragma omp parallel for for(i=0;i< N; i++) { for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } • Parallelize outer-most loop 2 Performance Issues II • Move synchronization points outwards. The inner loop is parallelized. • In each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead. { … for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } 3 Performance Issues III • Avoid parallel overhead at low iteration counts { … #pragma omp parallel for if(M > 800) for(j=0;j< M; j++) { aa[j] =alpha*bb[j] + cc[j]; } } 4 C++: Random Access Iterators Loops • Parallelization of random access iterator loops is supported void iterator_example(){ std::vector vec(23); std::vector::iterator it; #pragma omp parallel for default(none) shared(vec) for(it=vec.begin(); it< vec.end(); it++) { // do work with it // } } 5 Conditional Compilation • Keep sequential and parallel programs as a single source code #if def _OPENMP #include “omp.h” #endif Main() { #ifdef _OPENMP omp_set_num_threads(3); #endif for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } 6 Be Careful with Data Dependences • Whenever a statement in a program reads or writes a memory location and another statement reads or writes the same memory location, and at least one of the two statements writes the location, then there is a data dependence on that memory location between the two statements.
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • The Problem with Threads
    The Problem with Threads Edward A. Lee Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-1 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html January 10, 2006 Copyright © 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement This work was supported in part by the Center for Hybrid and Embedded Software Systems (CHESS) at UC Berkeley, which receives support from the National Science Foundation (NSF award No. CCR-0225610), the State of California Micro Program, and the following companies: Agilent, DGIST, General Motors, Hewlett Packard, Infineon, Microsoft, and Toyota. The Problem with Threads Edward A. Lee Professor, Chair of EE, Associate Chair of EECS EECS Department University of California at Berkeley Berkeley, CA 94720, U.S.A. [email protected] January 10, 2006 Abstract Threads are a seemingly straightforward adaptation of the dominant sequential model of computation to concurrent systems. Languages require little or no syntactic changes to sup- port threads, and operating systems and architectures have evolved to efficiently support them. Many technologists are pushing for increased use of multithreading in software in order to take advantage of the predicted increases in parallelism in computer architectures.
    [Show full text]
  • Parallelism in Cilk Plus
    Cilk Plus: Language Support for Thread and Vector Parallelism Arch D. Robison Intel Sr. Principal Engineer Outline Motivation for Intel® Cilk Plus SIMD notations Fork-Join notations Karatsuba multiplication example GCC branch Copyright© 2012, Intel Corporation. All rights reserved. 2 *Other brands and names are the property of their respective owners. Multi-Threading and Vectorization are Essential to Performance Latest Intel® Xeon® chip: 8 cores 2 independent threads per core 8-lane (single prec.) vector unit per thread = 128-fold potential for single socket Intel® Many Integrated Core Architecture >50 cores (KNC) ? independent threads per core 16-lane (single prec.) vector unit per thread = parallel heaven Copyright© 2012, Intel Corporation. All rights reserved. 3 *Other brands and names are the property of their respective owners. Importance of Abstraction Software outlives hardware. Recompiling is easier than rewriting. Coding too closely to hardware du jour makes moving to new hardware expensive. C++ philosophy: abstraction with minimal penalty Do not expect compiler to be clever. But let it do tedious bookkeeping. Copyright© 2012, Intel Corporation. All rights reserved. 4 *Other brands and names are the property of their respective owners. “Three Layer Cake” Abstraction Message Passing exploit multiple nodes Fork-Join exploit multiple cores exploit parallelism at multiple algorithmic levels SIMD exploit vector hardware Copyright© 2012, Intel Corporation. All rights reserved. 5 *Other brands and names are the property of their respective owners. Composition Message Driven compose via send/receive Fork-Join compose via call/return SIMD compose sequentially Copyright© 2012, Intel Corporation. All rights reserved. 6 *Other brands and names are the property of their respective owners.
    [Show full text]
  • Modifiable Array Data Structures for Mesh Topology
    Modifiable Array Data Structures for Mesh Topology Dan Ibanez Mark S Shephard February 27, 2016 Abstract Topological data structures are useful in many areas, including the various mesh data structures used in finite element applications. Based on the graph-theoretic foundation for these data structures, we begin with a generic modifiable graph data structure and apply successive optimiza- tions leading to a family of mesh data structures. The results are compact array-based mesh structures that can be modified in constant time. Spe- cific implementations for finite elements and graphics are studied in detail and compared to the current state of the art. 1 Introduction An unstructured mesh simulation code relies heavily on multiple core capabili- ties to deal with the mesh itself, and the range of features available at this level constrain the capabilities of the simulation as a whole. As such, the long-term goal towards which this paper contributes is the development of a mesh data structure with the following capabilities: 1. The flexibility to deal with evolving meshes 2. A highly scalable implementation for distributed memory computers 3. The ability to represent any of the conforming meshes typically used by Finite Element (FE) and Finite Volume (FV) methods 4. Minimize memory use 5. Maximize locality of storage 6. The ability to parallelize work inside supercomputer nodes with hybrid architecture such as accelerators This paper focuses on the first five goals; the sixth will be the subject of a future publication. In particular, we present a derivation for a family of structures with these properties: 1 1.
    [Show full text]
  • A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization
    TREES: A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization Blake A. Hechtman, Andrew D. Hilton, and Daniel J. Sorin Department of Electrical and Computer Engineering Duke University Abstract —We have developed a task-parallel runtime targeting CPUs are a poor fit for GPUs. To understand system, called TREES, that is designed for high why this mismatch exists, we must first understand the performance on CPU/GPU platforms. On platforms performance of an idealized task-parallel application with multiple CPUs, Cilk’s “work-first” principle (with no runtime) and then how the runtime’s overhead underlies how task-parallel applications can achieve affects it. The performance of a task-parallel application performance, but work-first is a poor fit for GPUs. We is a function of two characteristics: its total amount of build upon work-first to create the “work-together” work to be performed (T1, the time to execute on 1 principle that addresses the specific strengths and processor) and its critical path (T∞, the time to execute weaknesses of GPUs. The work-together principle on an infinite number of processors). Prior work has extends work-first by stating that (a) the overhead on shown that the runtime of a system with P processors, the critical path should be paid by the entire system at TP, is bounded by = ( ) + ( ) due to the once and (b) work overheads should be paid co- greedy o ff line scheduler bound [3][10]. operatively. We have implemented the TREES runtime A task-parallel runtime introduces overheads and, for in OpenCL, and we experimentally evaluate TREES purposes of performance analysis, we distinguish applications on a CPU/GPU platform.
    [Show full text]
  • CUDA Binary Utilities
    CUDA Binary Utilities Application Note DA-06762-001_v11.4 | September 2021 Table of Contents Chapter 1. Overview..............................................................................................................1 1.1. What is a CUDA Binary?...........................................................................................................1 1.2. Differences between cuobjdump and nvdisasm......................................................................1 Chapter 2. cuobjdump.......................................................................................................... 3 2.1. Usage......................................................................................................................................... 3 2.2. Command-line Options.............................................................................................................5 Chapter 3. nvdisasm.............................................................................................................8 3.1. Usage......................................................................................................................................... 8 3.2. Command-line Options...........................................................................................................14 Chapter 4. Instruction Set Reference................................................................................ 17 4.1. Kepler Instruction Set.............................................................................................................17
    [Show full text]
  • Lithe: Enabling Efficient Composition of Parallel Libraries
    BERKELEY PAR LAB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović [email protected] {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley HotPar Berkeley, CA March 31, 2009 1 How to Build Parallel Apps? BERKELEY PAR LAB Functionality: or or or App Resource Management: OS Hardware Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Need both programmer productivity and performance! 2 Composability is Key to Productivity BERKELEY PAR LAB App 1 App 2 App sort sort bubble quick sort sort code reuse modularity same library implementation, different apps same app, different library implementations Functional Composability 3 Composability is Key to Productivity BERKELEY PAR LAB fast + fast fast fast + faster fast(er) Performance Composability 4 Talk Roadmap BERKELEY PAR LAB Problem: Efficient parallel composability is hard! Solution: . Harts . Lithe Evaluation 5 Motivational Example BERKELEY PAR LAB Sparse QR Factorization (Tim Davis, Univ of Florida) Column Elimination SPQR Tree Frontal Matrix Factorization MKL TBB OpenMP OS Hardware System Stack Software Architecture 6 Out-of-the-Box Performance BERKELEY PAR LAB Performance of SPQR on 16-core Machine Out-of-the-Box sequential Time (sec) Time Input Matrix 7 Out-of-the-Box Libraries Oversubscribe the Resources BERKELEY PAR LAB TX TYTX TYTX TYTX A[0] TYTX A[0] TYTX A[0] TYTX A[0] TYTX TYA[0] A[0] A[0]A[0] +Z +Z +Z +Z +Z +Z +Z +ZA[1] A[1] A[1] A[1] A[1] A[1] A[1]A[1] A[2] A[2] A[2] A[2] A[2] A[2] A[2]A[2]
    [Show full text]
  • Thread Scheduling in Multi-Core Operating Systems Redha Gouicem
    Thread Scheduling in Multi-core Operating Systems Redha Gouicem To cite this version: Redha Gouicem. Thread Scheduling in Multi-core Operating Systems. Computer Science [cs]. Sor- bonne Université, 2020. English. tel-02977242 HAL Id: tel-02977242 https://hal.archives-ouvertes.fr/tel-02977242 Submitted on 24 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ph.D thesis in Computer Science Thread Scheduling in Multi-core Operating Systems How to Understand, Improve and Fix your Scheduler Redha GOUICEM Sorbonne Université Laboratoire d’Informatique de Paris 6 Inria Whisper Team PH.D.DEFENSE: 23 October 2020, Paris, France JURYMEMBERS: Mr. Pascal Felber, Full Professor, Université de Neuchâtel Reviewer Mr. Vivien Quéma, Full Professor, Grenoble INP (ENSIMAG) Reviewer Mr. Rachid Guerraoui, Full Professor, École Polytechnique Fédérale de Lausanne Examiner Ms. Karine Heydemann, Associate Professor, Sorbonne Université Examiner Mr. Etienne Rivière, Full Professor, University of Louvain Examiner Mr. Gilles Muller, Senior Research Scientist, Inria Advisor Mr. Julien Sopena, Associate Professor, Sorbonne Université Advisor ABSTRACT In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correct- ness), performance improvement and the development of application- specific schedulers.
    [Show full text]
  • Performance and Programmability Trade-Offs in the Opencl 2.0 SVM and Memory Model
    Performance and Programmability Trade-offs in the OpenCL 2.0 SVM and Memory Model Brian T. Lewis, Intel Labs Overview • This talk: – My experience working on the OpenCL 2.0 SVM & memory models – Observation: tension between performance and programmability – Programmability = productivity, ease-of-use, simplicity, error avoidance – For most programmers & architects today, performance is paramount • First, some background: why are GPUs programmed the way they are? – Discrete & integrated GPUs – GPU differences from CPUs – GPU performance considerations – GPGPU programming • OpenCL 2.0 and a few of its features, compromises, tradeoffs 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 2 A couple of comments first • These are my personal observations • OpenCL 2.0, and its SVM & memory model, are the work of many people – I’ve been impressed by the professionalism & care paid by Khronos OpenCL members – Disagreements often lead to new insights 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 3 GPUs: massive data-parallelism for modest energy • NVIDIA Tesla K40 discrete GPU: 4.3 TFLOPs, 235 Watts, $5,000 http://forum.beyond3d.com/showpost.php?p=1643034&postcount=107 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 4 Integrated CPU+GPU processors • More than 90% of processors shipping today include a GPU on die • Low energy use is a key design goal Intel 4th Generation Core Processor: “Haswell” AMD Kaveri APU http://www.geeks3d.com/20140114/amd-kaveri-a10-7850k-a10-7700k-and-a8-7600-apus-announced/ 4-core GT2 Desktop: 35 W package Desktop: 45-95 W package 2-core GT2 Ultrabook: 11.5 W package Mobile, embedded: 15 W package 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 5 Discrete & integrated processors • Different points in the performance-energy design space • 235W vs.
    [Show full text]