Threads ICS332 — Operating Systems

Total Page:16

File Type:pdf, Size:1020Kb

Threads ICS332 — Operating Systems Threads ICS332 | Operating Systems Henri Casanova ([email protected]) Spring 2018 Henri Casanova ([email protected]) Threads Concurrent computing: several computations are performed during overlapping time periods As opposed to sequential execution, which is one computation to completion, followed by one computation to completion, etc. Concurrency: A feature of a program that can do multiple things \at the same time" A program is concurrent if it consists of units that can be executed independently or partially out of order without changing the output of the program Concurrent Computing Henri Casanova ([email protected]) Threads Concurrency: A feature of a program that can do multiple things \at the same time" A program is concurrent if it consists of units that can be executed independently or partially out of order without changing the output of the program Concurrent Computing Concurrent computing: several computations are performed during overlapping time periods As opposed to sequential execution, which is one computation to completion, followed by one computation to completion, etc. Henri Casanova ([email protected]) Threads A program is concurrent if it consists of units that can be executed independently or partially out of order without changing the output of the program Concurrent Computing Concurrent computing: several computations are performed during overlapping time periods As opposed to sequential execution, which is one computation to completion, followed by one computation to completion, etc. Concurrency: A feature of a program that can do multiple things \at the same time" Henri Casanova ([email protected]) Threads Concurrent Computing Concurrent computing: several computations are performed during overlapping time periods As opposed to sequential execution, which is one computation to completion, followed by one computation to completion, etc. Concurrency: A feature of a program that can do multiple things \at the same time" A program is concurrent if it consists of units that can be executed independently or partially out of order without changing the output of the program Henri Casanova ([email protected]) Threads I want to output a boolean array where each element is true if and only if the corresponding integer in the input array is > 50 f false, true, false, true, true ...g Assume that it takes one millisecond to test an integer value and update the output array Sequential programming: Iterating through the array would take 10,000 milliseconds. Concurrent programming: If I create 10 \units of execution" to compute 1000 output values, i.e., 1/10-th of the work, each unit takes 1,000 milliseconds Now if I can execute these 10 units independently (on 10 CPUs), the whole execution takes 1,000 milliseconds, i.e., 10 times faster In practice, we can't go exactly 10 times faster due to various overheads and bottlenecks But we will go much faster than sequential, provided be have multiple CPUs (which we all do in this multicore era!) Concurrent Computing: Example Consider an input array of 10,000 integers with values between 0 and 100 f 23, 56, 7, 68, 68 ...g Henri Casanova ([email protected]) Threads Assume that it takes one millisecond to test an integer value and update the output array Sequential programming: Iterating through the array would take 10,000 milliseconds. Concurrent programming: If I create 10 \units of execution" to compute 1000 output values, i.e., 1/10-th of the work, each unit takes 1,000 milliseconds Now if I can execute these 10 units independently (on 10 CPUs), the whole execution takes 1,000 milliseconds, i.e., 10 times faster In practice, we can't go exactly 10 times faster due to various overheads and bottlenecks But we will go much faster than sequential, provided be have multiple CPUs (which we all do in this multicore era!) Concurrent Computing: Example Consider an input array of 10,000 integers with values between 0 and 100 f 23, 56, 7, 68, 68 ...g I want to output a boolean array where each element is true if and only if the corresponding integer in the input array is > 50 f false, true, false, true, true ...g Henri Casanova ([email protected]) Threads Sequential programming: Iterating through the array would take 10,000 milliseconds. Concurrent programming: If I create 10 \units of execution" to compute 1000 output values, i.e., 1/10-th of the work, each unit takes 1,000 milliseconds Now if I can execute these 10 units independently (on 10 CPUs), the whole execution takes 1,000 milliseconds, i.e., 10 times faster In practice, we can't go exactly 10 times faster due to various overheads and bottlenecks But we will go much faster than sequential, provided be have multiple CPUs (which we all do in this multicore era!) Concurrent Computing: Example Consider an input array of 10,000 integers with values between 0 and 100 f 23, 56, 7, 68, 68 ...g I want to output a boolean array where each element is true if and only if the corresponding integer in the input array is > 50 f false, true, false, true, true ...g Assume that it takes one millisecond to test an integer value and update the output array Henri Casanova ([email protected]) Threads Concurrent programming: If I create 10 \units of execution" to compute 1000 output values, i.e., 1/10-th of the work, each unit takes 1,000 milliseconds Now if I can execute these 10 units independently (on 10 CPUs), the whole execution takes 1,000 milliseconds, i.e., 10 times faster In practice, we can't go exactly 10 times faster due to various overheads and bottlenecks But we will go much faster than sequential, provided be have multiple CPUs (which we all do in this multicore era!) Concurrent Computing: Example Consider an input array of 10,000 integers with values between 0 and 100 f 23, 56, 7, 68, 68 ...g I want to output a boolean array where each element is true if and only if the corresponding integer in the input array is > 50 f false, true, false, true, true ...g Assume that it takes one millisecond to test an integer value and update the output array Sequential programming: Iterating through the array would take 10,000 milliseconds. Henri Casanova ([email protected]) Threads In practice, we can't go exactly 10 times faster due to various overheads and bottlenecks But we will go much faster than sequential, provided be have multiple CPUs (which we all do in this multicore era!) Concurrent Computing: Example Consider an input array of 10,000 integers with values between 0 and 100 f 23, 56, 7, 68, 68 ...g I want to output a boolean array where each element is true if and only if the corresponding integer in the input array is > 50 f false, true, false, true, true ...g Assume that it takes one millisecond to test an integer value and update the output array Sequential programming: Iterating through the array would take 10,000 milliseconds. Concurrent programming: If I create 10 \units of execution" to compute 1000 output values, i.e., 1/10-th of the work, each unit takes 1,000 milliseconds Now if I can execute these 10 units independently (on 10 CPUs), the whole execution takes 1,000 milliseconds, i.e., 10 times faster Henri Casanova ([email protected]) Threads Concurrent Computing: Example Consider an input array of 10,000 integers with values between 0 and 100 f 23, 56, 7, 68, 68 ...g I want to output a boolean array where each element is true if and only if the corresponding integer in the input array is > 50 f false, true, false, true, true ...g Assume that it takes one millisecond to test an integer value and update the output array Sequential programming: Iterating through the array would take 10,000 milliseconds. Concurrent programming: If I create 10 \units of execution" to compute 1000 output values, i.e., 1/10-th of the work, each unit takes 1,000 milliseconds Now if I can execute these 10 units independently (on 10 CPUs), the whole execution takes 1,000 milliseconds, i.e., 10 times faster In practice, we can't go exactly 10 times faster due to various overheads and bottlenecks But we will go much faster than sequential, provided be have multiple CPUs (which we all do in this multicore era!) Henri Casanova ([email protected]) Threads To make programs more responsive Structuring a program as concurrent activities can make it more responsive because while one activity blocks waiting for some event, another can do something e.g., in some server, spawn a new activity to answer each client request e.g., in some GUI, while one activity is updating some display another activity is listening for mouse clicks What is Concurrency Used for? Make programs faster Running multiple activities at once can use the machine more effectively because there are multiple hardware components e.g., while one activity computes on the CPU, another activity sends data to the network card e.g., while one activity computes on one CPU, another computes on another CPU (this is the example in the previous slide) Henri Casanova ([email protected]) Threads What is Concurrency Used for? Make programs faster Running multiple activities at once can use the machine more effectively because there are multiple hardware components e.g., while one activity computes on the CPU, another activity sends data to the network card e.g., while one activity computes on one CPU, another computes on another CPU (this is the example in the previous slide) To make programs more responsive Structuring a program as concurrent activities can make it more responsive because while one activity blocks waiting for some event, another can do something e.g., in some server, spawn a new activity to answer each client request e.g., in some GUI, while one activity is updating some display another activity is listening for mouse clicks Henri Casanova ([email protected]) Threads But because the OS virtualizes memory, processes don't share memory naturally You can make it happen with special system calls, but it's cumbersome This can make it difficult to program processes that have complicated cooperative behaviors And so come threads..
Recommended publications
  • Events, Co-Routines, Continuations and Threads OS (And Application)Execution Models System Building
    Events, Co-routines, Continuations and Threads OS (and application)Execution Models System Building General purpose systems need to deal with • Many activities – potentially overlapping – may be interdependent • Activities that depend on external phenomena – may requiring waiting for completion (e.g. disk read) – reacting to external triggers (e.g. interrupts) Need a systematic approach to system structuring © Kevin Elphinstone 2 Construction Approaches Events Coroutines Threads Continuations © Kevin Elphinstone 3 Events External entities generate (post) events. • keyboard presses, mouse clicks, system calls Event loop waits for events and calls an appropriate event handler. • common paradigm for GUIs Event handler is a function that runs until completion and returns to the event loop. © Kevin Elphinstone 4 Event Model The event model only requires a single stack Memory • All event handlers must return to the event loop CPU Event – No blocking Loop – No yielding PC Event SP Handler 1 REGS Event No preemption of handlers Handler 2 • Handlers generally short lived Event Handler 3 Data Stack © Kevin Elphinstone 5 What is ‘a’? int a; /* global */ int func() { a = 1; if (a == 1) { a = 2; } No concurrency issues within a return a; handler } © Kevin Elphinstone 6 Event-based kernel on CPU with protection Kernel-only Memory User Memory CPU Event Loop Scheduling? User PC Event Code SP Handler 1 REGS Event Handler 2 User Event Data Handler 3 Huh? How to support Data Stack multiple Stack processes? © Kevin Elphinstone 7 Event-based kernel on CPU with protection Kernel-only Memory User Memory CPU PC Trap SP Dispatcher User REGS Event Code Handler 1 User-level state in PCB Event PCB Handler 2 A User Kernel starts on fresh Timer Event Data stack on each trap (Scheduler) PCB B No interrupts, no blocking Data Current in kernel mode Thead PCB C Stack Stack © Kevin Elphinstone 8 Co-routines Originally described in: • Melvin E.
    [Show full text]
  • Designing an Ultra Low-Overhead Multithreading Runtime for Nim
    Designing an ultra low-overhead multithreading runtime for Nim Mamy Ratsimbazafy Weave [email protected] https://github.com/mratsim/weave Hello! I am Mamy Ratsimbazafy During the day blockchain/Ethereum 2 developer (in Nim) During the night, deep learning and numerical computing developer (in Nim) and data scientist (in Python) You can contact me at [email protected] Github: mratsim Twitter: m_ratsim 2 Where did this talk came from? ◇ 3 years ago: started writing a tensor library in Nim. ◇ 2 threading APIs at the time: OpenMP and simple threadpool ◇ 1 year ago: complete refactoring of the internals 3 Agenda ◇ Understanding the design space ◇ Hardware and software multithreading: definitions and use-cases ◇ Parallel APIs ◇ Sources of overhead and runtime design ◇ Minimum viable runtime plan in a weekend 4 Understanding the 1 design space Concurrency vs parallelism, latency vs throughput Cooperative vs preemptive, IO vs CPU 5 Parallelism is not 6 concurrency Kernel threading 7 models 1:1 Threading 1 application thread -> 1 hardware thread N:1 Threading N application threads -> 1 hardware thread M:N Threading M application threads -> N hardware threads The same distinctions can be done at a multithreaded language or multithreading runtime level. 8 The problem How to schedule M tasks on N hardware threads? Latency vs 9 Throughput - Do we want to do all the work in a minimal amount of time? - Numerical computing - Machine learning - ... - Do we want to be fair? - Clients-server - Video decoding - ... Cooperative vs 10 Preemptive Cooperative multithreading:
    [Show full text]
  • Threads Chapter 4
    Threads Chapter 4 Reading: 4.1,4.4, 4.5 1 Process Characteristics ● Unit of resource ownership - process is allocated: ■ a virtual address space to hold the process image ■ control of some resources (files, I/O devices...) ● Unit of dispatching - process is an execution path through one or more programs ■ execution may be interleaved with other process ■ the process has an execution state and a dispatching priority 2 Process Characteristics ● These two characteristics are treated independently by some recent OS ● The unit of dispatching is usually referred to a thread or a lightweight process ● The unit of resource ownership is usually referred to as a process or task 3 Multithreading vs. Single threading ● Multithreading: when the OS supports multiple threads of execution within a single process ● Single threading: when the OS does not recognize the concept of thread ● MS-DOS support a single user process and a single thread ● UNIX supports multiple user processes but only supports one thread per process ● Solaris /NT supports multiple threads 4 Threads and Processes 5 Processes Vs Threads ● Have a virtual address space which holds the process image ■ Process: an address space, an execution context ■ Protected access to processors, other processes, files, and I/O Class threadex{ resources Public static void main(String arg[]){ ■ Context switch between Int x=0; processes expensive My_thread t1= new my_thread(x); t1.start(); ● Threads of a process execute in Thr_wait(); a single address space System.out.println(x) ■ Global variables are
    [Show full text]
  • Lecture 10: Introduction to Openmp (Part 2)
    Lecture 10: Introduction to OpenMP (Part 2) 1 Performance Issues I • C/C++ stores matrices in row-major fashion. • Loop interchanges may increase cache locality { … #pragma omp parallel for for(i=0;i< N; i++) { for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } • Parallelize outer-most loop 2 Performance Issues II • Move synchronization points outwards. The inner loop is parallelized. • In each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead. { … for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } 3 Performance Issues III • Avoid parallel overhead at low iteration counts { … #pragma omp parallel for if(M > 800) for(j=0;j< M; j++) { aa[j] =alpha*bb[j] + cc[j]; } } 4 C++: Random Access Iterators Loops • Parallelization of random access iterator loops is supported void iterator_example(){ std::vector vec(23); std::vector::iterator it; #pragma omp parallel for default(none) shared(vec) for(it=vec.begin(); it< vec.end(); it++) { // do work with it // } } 5 Conditional Compilation • Keep sequential and parallel programs as a single source code #if def _OPENMP #include “omp.h” #endif Main() { #ifdef _OPENMP omp_set_num_threads(3); #endif for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } 6 Be Careful with Data Dependences • Whenever a statement in a program reads or writes a memory location and another statement reads or writes the same memory location, and at least one of the two statements writes the location, then there is a data dependence on that memory location between the two statements.
    [Show full text]
  • The Problem with Threads
    The Problem with Threads Edward A. Lee Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-1 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html January 10, 2006 Copyright © 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement This work was supported in part by the Center for Hybrid and Embedded Software Systems (CHESS) at UC Berkeley, which receives support from the National Science Foundation (NSF award No. CCR-0225610), the State of California Micro Program, and the following companies: Agilent, DGIST, General Motors, Hewlett Packard, Infineon, Microsoft, and Toyota. The Problem with Threads Edward A. Lee Professor, Chair of EE, Associate Chair of EECS EECS Department University of California at Berkeley Berkeley, CA 94720, U.S.A. [email protected] January 10, 2006 Abstract Threads are a seemingly straightforward adaptation of the dominant sequential model of computation to concurrent systems. Languages require little or no syntactic changes to sup- port threads, and operating systems and architectures have evolved to efficiently support them. Many technologists are pushing for increased use of multithreading in software in order to take advantage of the predicted increases in parallelism in computer architectures.
    [Show full text]
  • An Ideal Match?
    24 November 2020 An ideal match? Investigating how well-suited Concurrent ML is to implementing Belief Propagation for Stereo Matching James Cooper [email protected] OutlineI 1 Stereo Matching Generic Stereo Matching Belief Propagation 2 Concurrent ML Overview Investigation of Alternatives Comparative Benchmarks 3 Concurrent ML and Belief Propagation 4 Conclusion Recapitulation Prognostication 5 References Outline 1 Stereo Matching Generic Stereo Matching Belief Propagation 2 Concurrent ML Overview Investigation of Alternatives Comparative Benchmarks 3 Concurrent ML and Belief Propagation 4 Conclusion Recapitulation Prognostication 5 References Outline 1 Stereo Matching Generic Stereo Matching Belief Propagation 2 Concurrent ML Overview Investigation of Alternatives Comparative Benchmarks 3 Concurrent ML and Belief Propagation 4 Conclusion Recapitulation Prognostication 5 References Stereo Matching Generally SM is finding correspondences between stereo images Images are of the same scene Captured simultaneously Correspondences (`disparity') are used to estimate depth SM is an ill-posed problem { can only make best guess Impossible to perform `perfectly' in general case Stereo Matching ExampleI (a) Left camera's image (b) Right camera's image Figure 1: The popular 'Tsukuba' example stereo matching images, so called because they were created by researchers at the University of Tsukuba, Japan. They are probably the most widely-used benchmark images in stereo matching. Stereo Matching ExampleII (a) Ground truth disparity map (b) Disparity map generated using a simple Belief Propagation Stereo Matching implementation Figure 2: The ground truth disparity map for the Tsukuba images, and an example of a possible real disparity map produced by using Belief Propagation Stereo Matching. The ground truth represents what would be expected if stereo matching could be carried out `perfectly'.
    [Show full text]
  • Tcl and Java Performance
    Tcl and Java Performance http://ptolemy.eecs.berkeley.edu/~cxh/java/tclblend/scriptperf/scriptperf.html Tcl and Java Performance by H. John Reekie, University of California at Berkeley Christopher Hylands, University of California at Berkeley Edward A. Lee, University of California at Berkeley Abstract Combining scripting languages such as Tcl with lower−level programming languages such as Java offers new opportunities for flexible and rapid software development. In this paper, we benchmark various combinations of Tcl and Java against the two languages alone. We also provide some comparisons with JavaScript. Performance can vary by well over two orders of magnitude. We also uncovered some interesting threading issues that affect performance on the Solaris platform. "There are lies, damn lies and statistics" This paper is a work in progress, we used the information here to give our group some generalizations on the performance tradeoffs between various scripting languages. Updating the timing results to include JDK1.2 with a Just In Time (JIT) compiler would be useful. Introduction There is a growing trend towards integration of multiple languages through scripting. In a famously controversial white paper (Ousterhout 97), John Ousterhout, now of Scriptics Corporation, argues that scripting −− the use of a high−level, untyped, interpreted language to "glue" together components written in a lower−level language −− provides greater reuse benefits that other reuse technologies. Although traditionally a language such as C or C++ has been the lower−level language, more recent efforts have focused on using Java. Recently, Sun Microsystems laboratories announced two products aimed at fulfilling this goal with the Tcl and Java programming languages.
    [Show full text]
  • Modifiable Array Data Structures for Mesh Topology
    Modifiable Array Data Structures for Mesh Topology Dan Ibanez Mark S Shephard February 27, 2016 Abstract Topological data structures are useful in many areas, including the various mesh data structures used in finite element applications. Based on the graph-theoretic foundation for these data structures, we begin with a generic modifiable graph data structure and apply successive optimiza- tions leading to a family of mesh data structures. The results are compact array-based mesh structures that can be modified in constant time. Spe- cific implementations for finite elements and graphics are studied in detail and compared to the current state of the art. 1 Introduction An unstructured mesh simulation code relies heavily on multiple core capabili- ties to deal with the mesh itself, and the range of features available at this level constrain the capabilities of the simulation as a whole. As such, the long-term goal towards which this paper contributes is the development of a mesh data structure with the following capabilities: 1. The flexibility to deal with evolving meshes 2. A highly scalable implementation for distributed memory computers 3. The ability to represent any of the conforming meshes typically used by Finite Element (FE) and Finite Volume (FV) methods 4. Minimize memory use 5. Maximize locality of storage 6. The ability to parallelize work inside supercomputer nodes with hybrid architecture such as accelerators This paper focuses on the first five goals; the sixth will be the subject of a future publication. In particular, we present a derivation for a family of structures with these properties: 1 1.
    [Show full text]
  • CUDA Binary Utilities
    CUDA Binary Utilities Application Note DA-06762-001_v11.4 | September 2021 Table of Contents Chapter 1. Overview..............................................................................................................1 1.1. What is a CUDA Binary?...........................................................................................................1 1.2. Differences between cuobjdump and nvdisasm......................................................................1 Chapter 2. cuobjdump.......................................................................................................... 3 2.1. Usage......................................................................................................................................... 3 2.2. Command-line Options.............................................................................................................5 Chapter 3. nvdisasm.............................................................................................................8 3.1. Usage......................................................................................................................................... 8 3.2. Command-line Options...........................................................................................................14 Chapter 4. Instruction Set Reference................................................................................ 17 4.1. Kepler Instruction Set.............................................................................................................17
    [Show full text]
  • Lithe: Enabling Efficient Composition of Parallel Libraries
    BERKELEY PAR LAB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović [email protected] {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley HotPar Berkeley, CA March 31, 2009 1 How to Build Parallel Apps? BERKELEY PAR LAB Functionality: or or or App Resource Management: OS Hardware Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Need both programmer productivity and performance! 2 Composability is Key to Productivity BERKELEY PAR LAB App 1 App 2 App sort sort bubble quick sort sort code reuse modularity same library implementation, different apps same app, different library implementations Functional Composability 3 Composability is Key to Productivity BERKELEY PAR LAB fast + fast fast fast + faster fast(er) Performance Composability 4 Talk Roadmap BERKELEY PAR LAB Problem: Efficient parallel composability is hard! Solution: . Harts . Lithe Evaluation 5 Motivational Example BERKELEY PAR LAB Sparse QR Factorization (Tim Davis, Univ of Florida) Column Elimination SPQR Tree Frontal Matrix Factorization MKL TBB OpenMP OS Hardware System Stack Software Architecture 6 Out-of-the-Box Performance BERKELEY PAR LAB Performance of SPQR on 16-core Machine Out-of-the-Box sequential Time (sec) Time Input Matrix 7 Out-of-the-Box Libraries Oversubscribe the Resources BERKELEY PAR LAB TX TYTX TYTX TYTX A[0] TYTX A[0] TYTX A[0] TYTX A[0] TYTX TYA[0] A[0] A[0]A[0] +Z +Z +Z +Z +Z +Z +Z +ZA[1] A[1] A[1] A[1] A[1] A[1] A[1]A[1] A[2] A[2] A[2] A[2] A[2] A[2] A[2]A[2]
    [Show full text]
  • Thread Scheduling in Multi-Core Operating Systems Redha Gouicem
    Thread Scheduling in Multi-core Operating Systems Redha Gouicem To cite this version: Redha Gouicem. Thread Scheduling in Multi-core Operating Systems. Computer Science [cs]. Sor- bonne Université, 2020. English. tel-02977242 HAL Id: tel-02977242 https://hal.archives-ouvertes.fr/tel-02977242 Submitted on 24 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ph.D thesis in Computer Science Thread Scheduling in Multi-core Operating Systems How to Understand, Improve and Fix your Scheduler Redha GOUICEM Sorbonne Université Laboratoire d’Informatique de Paris 6 Inria Whisper Team PH.D.DEFENSE: 23 October 2020, Paris, France JURYMEMBERS: Mr. Pascal Felber, Full Professor, Université de Neuchâtel Reviewer Mr. Vivien Quéma, Full Professor, Grenoble INP (ENSIMAG) Reviewer Mr. Rachid Guerraoui, Full Professor, École Polytechnique Fédérale de Lausanne Examiner Ms. Karine Heydemann, Associate Professor, Sorbonne Université Examiner Mr. Etienne Rivière, Full Professor, University of Louvain Examiner Mr. Gilles Muller, Senior Research Scientist, Inria Advisor Mr. Julien Sopena, Associate Professor, Sorbonne Université Advisor ABSTRACT In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correct- ness), performance improvement and the development of application- specific schedulers.
    [Show full text]
  • Performance and Programmability Trade-Offs in the Opencl 2.0 SVM and Memory Model
    Performance and Programmability Trade-offs in the OpenCL 2.0 SVM and Memory Model Brian T. Lewis, Intel Labs Overview • This talk: – My experience working on the OpenCL 2.0 SVM & memory models – Observation: tension between performance and programmability – Programmability = productivity, ease-of-use, simplicity, error avoidance – For most programmers & architects today, performance is paramount • First, some background: why are GPUs programmed the way they are? – Discrete & integrated GPUs – GPU differences from CPUs – GPU performance considerations – GPGPU programming • OpenCL 2.0 and a few of its features, compromises, tradeoffs 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 2 A couple of comments first • These are my personal observations • OpenCL 2.0, and its SVM & memory model, are the work of many people – I’ve been impressed by the professionalism & care paid by Khronos OpenCL members – Disagreements often lead to new insights 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 3 GPUs: massive data-parallelism for modest energy • NVIDIA Tesla K40 discrete GPU: 4.3 TFLOPs, 235 Watts, $5,000 http://forum.beyond3d.com/showpost.php?p=1643034&postcount=107 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 4 Integrated CPU+GPU processors • More than 90% of processors shipping today include a GPU on die • Low energy use is a key design goal Intel 4th Generation Core Processor: “Haswell” AMD Kaveri APU http://www.geeks3d.com/20140114/amd-kaveri-a10-7850k-a10-7700k-and-a8-7600-apus-announced/ 4-core GT2 Desktop: 35 W package Desktop: 45-95 W package 2-core GT2 Ultrabook: 11.5 W package Mobile, embedded: 15 W package 3/2/2014 Trade-offs in OpenCL 2.0 SVM and Memory Model 5 Discrete & integrated processors • Different points in the performance-energy design space • 235W vs.
    [Show full text]