Boost.SIMD: Generic Programming for Portable Simdization

Total Page:16

File Type:pdf, Size:1020Kb

Boost.SIMD: Generic Programming for Portable Simdization Boost.SIMD: Generic Programming for portable SIMDization Pierre Est´erie Mathias Gaunard Joel Falcou LRI, Universit´eParis-Sud XI Metascale LRI, Universit´eParis-Sud XI Orsay, France Orsay, France Orsay, France [email protected] [email protected] [email protected] Jean-Thierry Laprest´e Brigitte Rozoy LASMEA, Universit´eBlaise Pascal LRI, Universit´eParis-Sud XI Clermont-Ferrand, France Orsay, France [email protected] [email protected] ABSTRACT offer rich SIMD instruction sets working with larger and SIMD extensions have been a feature of choice for proces- larger SIMD registers (table 1). For example, the AVX sor manufacturers for a couple of decades. Designed to ex- extension introduced in 2011 enhances the x86 instruc- ploit data parallelism in applications at the instruction level, tion set for the Intel Sandy Bridge and AMD Bulldozer these extensions still require a high level of expertise or the micro-architectures by providing a distinct set of 16 256- use of potentially fragile compiler support or vendor-specific bit registers. Similary, the forthcoming Intel MIC Ar- libraries. While a large fraction of their theoretical accel- chitecture will embed 512-bit SIMD registers. Usage of erations can be obtained using such tools, exploiting such SIMD processing units can also be mandatory for per- hardware becomes tedious as soon as application portabil- formance on embedded systems as demonstrated by the ity across hardware is required. In this paper, we describe NEON and NEON2 ARM extensions [7] or the CELL- Boost.SIMD, a C++ template library that simplifies the BE processor by IBM [9] which SPUs were designed as exploitation of SIMD hardware within a standard C++ pro- SIMD-only system. gramming model. Boost.SIMD provides a portable way to vectorize computation on Altivec, SSE or AVX while pro- Table 1: SIMD extensions in modern processors viding a generic way to extend the set of supported functions Registers Manufacturer Extension Instructions and hardwares. We introduce a C++ standard compliant in- size & nbr terface for the users which increases expressiveness by pro- SSE 128 bits - 8 70 viding a high-level abstraction to handle SIMD operations, SSE2 128 bits - 8/16 214 an extension-specific optimization pass and a set of SIMD SSE3 128 bits - 8/16 227 SSSE3 128 bits - 8/16 227 aware standard compliant algorithms which allow to reuse Intel classical C++ abstractions for SIMD computation. We as- SSE4.1 128 bits - 8/16 274 sess Boost.SIMD performance and applicability by pro- SSE4.2 128 bits - 8/16 281 viding an implementation of BLAS and image processing AVX 256 bits - 8/16 292 algorithms. AMD SSE4a 128 bits - 8/16 231 IBM VMX 128 - 32 114 Keywords Motorola VMX128 128 bits - 128 VSX 128 bits - 64 SIMD, C++, Generic Programming, Template Metapro- ARM NEON 128 bits - 16 100+ gramming However, programming applications that take advan- 1. INTRODUCTION tage of the SIMD extension available on the current Since the late 90's, processor manufacturers provide target remains a complex task. Programmers that use specialized processing units called multimedia exten- low-level intrinsics have to deal with a verbose program- sions or Single Instruction Multiple Data (SIMD) ex- ming style due to the fact that SIMD instructions sets tensions. The introduction of this feature has allowed cover a few common functionnalities, requiring to bury processors to exploit the latent data parallelism avail- the initial algorithms in architecture specific implemen- able in applications by executing a given instruction si- tation details. Furthermore, these efforts have to be re- multaneously on multiple data stored in a single special peated for every different extension that one may want register. With a constantly increasing need for perfor- to support, making design and maintenance of such ap- mance in applications, today's processor architectures plications very time consuming. 1 with a sensible amount (200+) of mathematical Different approaches have been suggested to limit these functions and utility functions, shortcomings. On the compiler side, autovectorizers implement code analysis and transform phases to gener- • A support for C++ standard components ate vectorized code as part of the code generation pro- like Iterator over SIMD Range, SIMD-aware al- cess [15, 18] or as an offline process [13]. By this method locators and SIMD-aware STL algorithms. compilers are able to detect code fragments that can be vectorized. Autovectorizers find their limits when the 2.1 SIMD register abstraction classical code is not presenting a clear vectorizable pat- The first level of abstraction introduced by Boost. tern (complex data dependencies, non-contiguous mem- SIMD is the pack class. For a given type T and a given ory accesses, aliasing or control flows). This results static integral value N (N being a power of 2), a pack in a fragile SIMD code generation where performance encapsulates the best type able to store a sequence of and SIMD usage opportunities are uncertain. Other N elements of type T. For arbitrary T and N, this type compiler-based systems use code directives to guide the is simply std::array<T,N> but when T and N matches vectorization process by enforcing loop vectorization. the type and width of a SIMD register, the architecture- The ICC and GCC extension #pragma simd is such a specific type used to represent this register is used in- system. To use this mechanism, developers explicitly in- stead. This semantic provides a way to use arbitrarily troduce directives in their code, thus having a fine grain large SIMD registers on any system and let the library control on where to apply SIMDization. However, the select the best vectorizable type to handle them. By de- generated code quality will greatly depend on the used fault, if N is not provided, pack will automatically select compiler. Another approach is to use libraries like In- a value that will trigger the selection of the native SIMD tel MKL [6] or its AMD equivalent (ACML) [1]. Those register type. Moreover, by carrying informations about libraries offer a set of domain-specific routines (usually its underlying scalar type, pack enables proper instruc- linear algebra and/or signal processing) that are opti- tion selection even when used on extensions (like SSE2 mized for a given architecture. This solution suffers and above) that map all integral type to a single SIMD from a lack of flexibility as the proposed routines are type (__m128i for SSE2). optimized for specific use-case that may not fulfill arbi- trary code constraints. pack handles these low-level SIMD register type as regular objects with value semantics, which includes the In this paper, we present Boost.SIMD, a high-level ability to be constructed or copied from a single scalar C++ library to program SIMD architectures. Designed value, list of scalar values, iterator or range. In each as an Embedded Domain Specific Language, Boost.SIMD case, the proper register loading strategy (splat, set, provides both expresiveness and performance by using load or gather) will be issued. pack also takes care of generic programming to handle vectorization in a porta- issues like boolean predicates support and its range and ble way. The paper is organized as follows: Section 2 tuple-like interface. describes the library API, its essential functionalities 2.1.1 Predicates handling and its interaction with standard C++ idioms. Section 3 details the implementation of the code generation en- Comparisons over SIMD register yield a register of gine and its extension systems. In Section 4, experi- boolean results. Problems arise when one wants to use mental results obtained with a representative set of al- such operations in conjunction with regular, non-SIMD gorithms are shown and commented. Lastly, Section 5 code: sums up our contributions. • To use custom types in C++ standard contain- ers or algorithms, comparison operators over these 2. BOOST.SIMD API types have to yield a single bool, The main issue of SIMD programming is the lack of proper abstractions over the usage of SIMD registers. • Boolean SIMD values are either 0 or ~0 instead of This abstraction should not only provide a portable way 0 and 1, to use hardware-specific registers but also enable the use • While most SIMD extensions use the same register of common programming idioms when designing SIMD- type to store booleans, some like Intel MIC have a aware algorithms. Boost.SIMD solves these problems special register bank for SIMD boolean operations. by providing two components : Boost.SIMD solves these discrepancies by providing • An abstraction of SIMD registers for portable an abstraction over boolean values and a set of associ- algorithm design, coupled with a large set of func- ated operations. The logical class encapsulates the tions covering the classical set of operators along notion of a boolean value and can be combined with 2 pack. Thus, for any type T, an instance of pack< log- of large pack-based expressions and performs compile- ical<T> > encapsulates the proper SIMD register type time optimization on this AST. These optimizations in- able to store boolean values resulting from the appli- clude the detection of fused operation and replacement cation of a SIMD predicates over a pack<T>. More- or reordering of reductions versus elementwise opera- over, the logical class enables a compile-time detec- tions. This compile-time optimization pass ensures that tion of functions acting as predicates and allows opti- every architecture-specific optimization opportunity is mization of SIMD selection operations. With this sys- captured and replaced by the superior version of the tem, the comparison operators yield a single bool based code. Moreover, the AST-based evaluation process is on the lexicographical order thus making pack usable able to merge multiple function calls into a single in- as an element of standard containers while SIMD com- lined one, contrary to solutions like MKL where each parisons are done through a set of predicate functions function can only be applied on the whole data range like is_equal, is_greater, etc.
Recommended publications
  • (LINUX) Computer in Three Parts
    Science on a (LINUX) computer in three parts Introductory short couse, Part II Thorsten Becker University of Southern California, Los Angeles September 2007 The first part dealt with ● UNIX: what and why ● File system, Window managers ● Shell environment ● Editing files ● Command line tools ● Scripts and GUIs ● Virtualization Contents part II ● Typesetting ● Programming – common languages – philosophy – compiling, debugging, make, version control – C and F77 interfacing – libraries and packages ● Number crunching ● Visualization tools Programming: Traditional languages in the natural sciences ● Fortran: higher level, good for math – F77: legacy, don't use (but know how to read) – F90/F95: nice vector features, finally implements C capabilities (structures, memory allocation) ● C: low level (e.g. pointers), better structured – very close to UNIX philosophy – structures offer nice way of modular programming, see Wikipedia on C ● I recommend F95, and use C happily myself Programming: Some Languages that haven't completely made it to scientific computing ● C++: object oriented programming model – reusable objects with methods and such – can be partly realized by modular programming in C ● Java: what's good for commercial projects (or smart, or elegant) doesn't have to be good for scientific computing ● Concern about portability as well as general access Programming: Compromises ● Python – Object oriented – Interpreted – Interfaces easily with F90/C – Numerous scientific packages Programming: Other interpreted, high- abstraction languages
    [Show full text]
  • Threads Chapter 4
    Threads Chapter 4 Reading: 4.1,4.4, 4.5 1 Process Characteristics ● Unit of resource ownership - process is allocated: ■ a virtual address space to hold the process image ■ control of some resources (files, I/O devices...) ● Unit of dispatching - process is an execution path through one or more programs ■ execution may be interleaved with other process ■ the process has an execution state and a dispatching priority 2 Process Characteristics ● These two characteristics are treated independently by some recent OS ● The unit of dispatching is usually referred to a thread or a lightweight process ● The unit of resource ownership is usually referred to as a process or task 3 Multithreading vs. Single threading ● Multithreading: when the OS supports multiple threads of execution within a single process ● Single threading: when the OS does not recognize the concept of thread ● MS-DOS support a single user process and a single thread ● UNIX supports multiple user processes but only supports one thread per process ● Solaris /NT supports multiple threads 4 Threads and Processes 5 Processes Vs Threads ● Have a virtual address space which holds the process image ■ Process: an address space, an execution context ■ Protected access to processors, other processes, files, and I/O Class threadex{ resources Public static void main(String arg[]){ ■ Context switch between Int x=0; processes expensive My_thread t1= new my_thread(x); t1.start(); ● Threads of a process execute in Thr_wait(); a single address space System.out.println(x) ■ Global variables are
    [Show full text]
  • Some Notes on BLAS, SIMD, MIC and GPGPU (For Octave and Matlab)
    Some Notes on BLAS, SIMD, MIC and GPGPU (for Octave and Matlab) Christian Himpe ([email protected]) WWU Münster Institute for Computational and Applied Mathematics Software-Tools für die Numerische Mathematik 15.10.2014 About1 BLAS & LAPACK Affinity SIMD Automatic Offloading 1 Get your Buzzword-Bingo ready! BLAS & LAPACK BLAS (Basic Linear Algebra System) Level 1: vector-vector operations (dot-product, vector norms, generalized vector addition) Level 2: matrix-vector operations (generalized matrix-vector multiplication) Level 3: matrix-matrix operations (generalized matrix-matrix multiplication) LAPACK (Linear Algebra Package) Matrix Factorizations: LU, QR, Cholesky, Schur Least-Squares: LLS, LSE, GLM Eigenproblems: SEP, NEP, SVD Default (un-optimized) netlib reference implementation. Used by Octave, Matlab, SciPy/NumPy (Python), Julia, R. MKL Intel’s implementation of BLAS and LAPACK. MKL (Math Kernel Library) can offload computations to XeonPhi can use OpenMP provides additionally FFT, libm and 1D-interpolation Current Version: 11.0.5 (automatic offloading to Phis) Costs (we have it, and Matlab ships with it, too) Go to: https://software.intel.com/en-us/intel-mkl ACML AMD doesn’t want to be left out. ACML (AMD Core Math Libraries) Can use OpenCL for automatic offloading to GPUs Special version to exploit FMA(4) instructions Choice of compiled binaries (GFortran, Intel Fortran, Open64, PGI) Current Version: 6 (automatic offloading to GPUs) Free (but not open source) Go to: http://developer.amd.com/tools-and-sdks/ cpu-development/amd-core-math-library-acml OpenBLAS Open Source, the third kind. OpenBLAS Fork of GotoBLAS Good performance for dense operations (close to the MKL) Can use OpenMP and is compiled for specific architecture Current Version: 0.2.11 Open source (!) Go to: http://github.com/xianyi/OpenBLAS FlexiBLAS So many BLAS implementations, so little time..
    [Show full text]
  • Lecture 10: Introduction to Openmp (Part 2)
    Lecture 10: Introduction to OpenMP (Part 2) 1 Performance Issues I • C/C++ stores matrices in row-major fashion. • Loop interchanges may increase cache locality { … #pragma omp parallel for for(i=0;i< N; i++) { for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } • Parallelize outer-most loop 2 Performance Issues II • Move synchronization points outwards. The inner loop is parallelized. • In each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead. { … for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } 3 Performance Issues III • Avoid parallel overhead at low iteration counts { … #pragma omp parallel for if(M > 800) for(j=0;j< M; j++) { aa[j] =alpha*bb[j] + cc[j]; } } 4 C++: Random Access Iterators Loops • Parallelization of random access iterator loops is supported void iterator_example(){ std::vector vec(23); std::vector::iterator it; #pragma omp parallel for default(none) shared(vec) for(it=vec.begin(); it< vec.end(); it++) { // do work with it // } } 5 Conditional Compilation • Keep sequential and parallel programs as a single source code #if def _OPENMP #include “omp.h” #endif Main() { #ifdef _OPENMP omp_set_num_threads(3); #endif for(i=0;i< N; i++) { #pragma omp parallel for for(j=0;j< M; j++) { A[i][j] =B[i][j] + C[i][j]; } } } 6 Be Careful with Data Dependences • Whenever a statement in a program reads or writes a memory location and another statement reads or writes the same memory location, and at least one of the two statements writes the location, then there is a data dependence on that memory location between the two statements.
    [Show full text]
  • The Problem with Threads
    The Problem with Threads Edward A. Lee Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-1 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.html January 10, 2006 Copyright © 2006, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement This work was supported in part by the Center for Hybrid and Embedded Software Systems (CHESS) at UC Berkeley, which receives support from the National Science Foundation (NSF award No. CCR-0225610), the State of California Micro Program, and the following companies: Agilent, DGIST, General Motors, Hewlett Packard, Infineon, Microsoft, and Toyota. The Problem with Threads Edward A. Lee Professor, Chair of EE, Associate Chair of EECS EECS Department University of California at Berkeley Berkeley, CA 94720, U.S.A. [email protected] January 10, 2006 Abstract Threads are a seemingly straightforward adaptation of the dominant sequential model of computation to concurrent systems. Languages require little or no syntactic changes to sup- port threads, and operating systems and architectures have evolved to efficiently support them. Many technologists are pushing for increased use of multithreading in software in order to take advantage of the predicted increases in parallelism in computer architectures.
    [Show full text]
  • Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems
    DOI: 10.14529/jsfi150405 Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems M. Abalenkovs1, A. Abdelfattah2, J. Dongarra 1,2,3, M. Gates2, A. Haidar2, J. Kurzak2, P. Luszczek2, S. Tomov2, I. Yamazaki2, A. YarKhan2 c The Authors 2015. This paper is published with open access at SuperFri.org We present a review of the current best practices in parallel programming models for dense linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries – in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA (for multicore CPUs) and MAGMA (for heterogeneous architectures), as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models – especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need – in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems. Keywords: Programming models, runtime, HPC, GPU, multicore, dense linear algebra. Introduction Parallelism in today’s computer architectures is pervasive – not only in systems from large supercomputers to laptops, but also in small portable devices like smart phones and watches. Along with parallelism, the level of heterogeneity in modern computing systems is also gradually increasing. Multicore CPUs are often combined with discrete high-performance accelerators like graphics processing units (GPUs) and coprocessors like Intel’s Xeon Phi, but are also included as integrated parts with system-on-chip (SoC) platforms, as in the AMD Fusion family of ap- plication processing units (APUs) or the NVIDIA Tegra mobile family of devices.
    [Show full text]
  • Modifiable Array Data Structures for Mesh Topology
    Modifiable Array Data Structures for Mesh Topology Dan Ibanez Mark S Shephard February 27, 2016 Abstract Topological data structures are useful in many areas, including the various mesh data structures used in finite element applications. Based on the graph-theoretic foundation for these data structures, we begin with a generic modifiable graph data structure and apply successive optimiza- tions leading to a family of mesh data structures. The results are compact array-based mesh structures that can be modified in constant time. Spe- cific implementations for finite elements and graphics are studied in detail and compared to the current state of the art. 1 Introduction An unstructured mesh simulation code relies heavily on multiple core capabili- ties to deal with the mesh itself, and the range of features available at this level constrain the capabilities of the simulation as a whole. As such, the long-term goal towards which this paper contributes is the development of a mesh data structure with the following capabilities: 1. The flexibility to deal with evolving meshes 2. A highly scalable implementation for distributed memory computers 3. The ability to represent any of the conforming meshes typically used by Finite Element (FE) and Finite Volume (FV) methods 4. Minimize memory use 5. Maximize locality of storage 6. The ability to parallelize work inside supercomputer nodes with hybrid architecture such as accelerators This paper focuses on the first five goals; the sixth will be the subject of a future publication. In particular, we present a derivation for a family of structures with these properties: 1 1.
    [Show full text]
  • AMD Core Math Library (ACML)
    AMD Core Math Library (ACML) Version 4.2.0 Copyright c 2003-2008 Advanced Micro Devices, Inc., Numerical Algorithms Group Ltd. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon and combinations thereof are trademarks of Advanced Micro Devices, Inc. i Short Contents 1 Introduction................................... 1 2 General Information ............................. 2 3 BLAS: Basic Linear Algebra Subprograms ............. 19 4 LAPACK: Package of Linear Algebra Subroutines ........ 20 5 Fast Fourier Transforms (FFTs) .................... 24 6 Random Number Generators....................... 75 7 ACML MV: Fast Math and Fast Vector Math Library .... 163 8 References .................................. 229 Subject Index ................................... 230 Routine Index ................................... 233 ii Table of Contents 1 Introduction ............................... 1 2 General Information ....................... 2 2.1 Determining the best ACML version for your system ........... 2 2.2 Accessing the Library (Linux) ................................ 4 2.2.1 Accessing the Library under Linux using GNU gfortran/gcc ......................................................... 4 2.2.2 Accessing the Library under Linux using PGI compilers pgf77/pgf90/pgcc ......................................... 5 2.2.3 Accessing the Library under Linux using PathScale compilers pathf90/pathcc ........................................... 6 2.2.4 Accessing the Library under Linux using the NAGWare f95 compiler.................................................
    [Show full text]
  • CUDA Binary Utilities
    CUDA Binary Utilities Application Note DA-06762-001_v11.4 | September 2021 Table of Contents Chapter 1. Overview..............................................................................................................1 1.1. What is a CUDA Binary?...........................................................................................................1 1.2. Differences between cuobjdump and nvdisasm......................................................................1 Chapter 2. cuobjdump.......................................................................................................... 3 2.1. Usage......................................................................................................................................... 3 2.2. Command-line Options.............................................................................................................5 Chapter 3. nvdisasm.............................................................................................................8 3.1. Usage......................................................................................................................................... 8 3.2. Command-line Options...........................................................................................................14 Chapter 4. Instruction Set Reference................................................................................ 17 4.1. Kepler Instruction Set.............................................................................................................17
    [Show full text]
  • Lithe: Enabling Efficient Composition of Parallel Libraries
    BERKELEY PAR LAB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović [email protected] {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley HotPar Berkeley, CA March 31, 2009 1 How to Build Parallel Apps? BERKELEY PAR LAB Functionality: or or or App Resource Management: OS Hardware Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Need both programmer productivity and performance! 2 Composability is Key to Productivity BERKELEY PAR LAB App 1 App 2 App sort sort bubble quick sort sort code reuse modularity same library implementation, different apps same app, different library implementations Functional Composability 3 Composability is Key to Productivity BERKELEY PAR LAB fast + fast fast fast + faster fast(er) Performance Composability 4 Talk Roadmap BERKELEY PAR LAB Problem: Efficient parallel composability is hard! Solution: . Harts . Lithe Evaluation 5 Motivational Example BERKELEY PAR LAB Sparse QR Factorization (Tim Davis, Univ of Florida) Column Elimination SPQR Tree Frontal Matrix Factorization MKL TBB OpenMP OS Hardware System Stack Software Architecture 6 Out-of-the-Box Performance BERKELEY PAR LAB Performance of SPQR on 16-core Machine Out-of-the-Box sequential Time (sec) Time Input Matrix 7 Out-of-the-Box Libraries Oversubscribe the Resources BERKELEY PAR LAB TX TYTX TYTX TYTX A[0] TYTX A[0] TYTX A[0] TYTX A[0] TYTX TYA[0] A[0] A[0]A[0] +Z +Z +Z +Z +Z +Z +Z +ZA[1] A[1] A[1] A[1] A[1] A[1] A[1]A[1] A[2] A[2] A[2] A[2] A[2] A[2] A[2]A[2]
    [Show full text]
  • Thread Scheduling in Multi-Core Operating Systems Redha Gouicem
    Thread Scheduling in Multi-core Operating Systems Redha Gouicem To cite this version: Redha Gouicem. Thread Scheduling in Multi-core Operating Systems. Computer Science [cs]. Sor- bonne Université, 2020. English. tel-02977242 HAL Id: tel-02977242 https://hal.archives-ouvertes.fr/tel-02977242 Submitted on 24 Oct 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Ph.D thesis in Computer Science Thread Scheduling in Multi-core Operating Systems How to Understand, Improve and Fix your Scheduler Redha GOUICEM Sorbonne Université Laboratoire d’Informatique de Paris 6 Inria Whisper Team PH.D.DEFENSE: 23 October 2020, Paris, France JURYMEMBERS: Mr. Pascal Felber, Full Professor, Université de Neuchâtel Reviewer Mr. Vivien Quéma, Full Professor, Grenoble INP (ENSIMAG) Reviewer Mr. Rachid Guerraoui, Full Professor, École Polytechnique Fédérale de Lausanne Examiner Ms. Karine Heydemann, Associate Professor, Sorbonne Université Examiner Mr. Etienne Rivière, Full Professor, University of Louvain Examiner Mr. Gilles Muller, Senior Research Scientist, Inria Advisor Mr. Julien Sopena, Associate Professor, Sorbonne Université Advisor ABSTRACT In this thesis, we address the problem of schedulers for multi-core architectures from several perspectives: design (simplicity and correct- ness), performance improvement and the development of application- specific schedulers.
    [Show full text]
  • 35.232-2016.43 Lamees Elhiny.Pdf
    LOAD PARTITIONING FOR MATRIX-MATRIX MULTIPLICATION ON A CLUSTER OF CPU-GPU NODES USING THE DIVISIBLE LOAD PARADIGM by Lamees Elhiny A Thesis Presented to the Faculty of the American University of Sharjah College of Engineering in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Sharjah, United Arab Emirates November 2016 c 2016. Lamees Elhiny. All rights reserved. Approval Signatures We, the undersigned, approve the Master’s Thesis of Lamees Elhiny Thesis Title: Load Partitioning for Matrix-Matrix Multiplication on a Cluster of CPU- GPU Nodes Using the Divisible Load Paradigm Signature Date of Signature (dd/mm/yyyy) ___________________________ _______________ Dr. Gerassimos Barlas Professor, Department of Computer Science and Engineering Thesis Advisor ___________________________ _______________ Dr. Khaled El-Fakih Associate Professor, Department of Computer Science and Engineering Thesis Committee Member ___________________________ _______________ Dr. Naoufel Werghi Associate Professor, Electrical and Computer Engineering Department Khalifa University of Science, Technology & Research (KUSTAR) Thesis Committee Member ___________________________ _______________ Dr. Fadi Aloul Head, Department of Computer Science and Engineering ___________________________ _______________ Dr. Mohamed El-Tarhuni Associate Dean, College of Engineering ___________________________ _______________ Dr. Richard Schoephoerster Dean, College of Engineering ___________________________ _______________ Dr. Khaled Assaleh Interim Vice Provost for Research and Graduate Studies Acknowledgements I would like to express my sincere gratitude to my thesis advisor, Dr. Gerassi- mos Barlas, for his guidance at every step throughout the thesis. I am thankful to Dr. Assim Sagahyroon, and to the Department of Computer Science & Engineering as well as the American University of Sharjah for offering me the Graduate Teaching Assistantship, which allowed me to pursue my graduate studies.
    [Show full text]