User-Defined Execution Relaxations for Enhanced Programmability in High-Performance Parallel Computing

Total Page:16

File Type:pdf, Size:1020Kb

User-Defined Execution Relaxations for Enhanced Programmability in High-Performance Parallel Computing UNIVERSIDAD COMPLUTENSE DE MADRID FACULTAD DE INFORMATICA TESIS DOCTORAL User-defined execution relaxations for enhanced programmability in high-performance parallel computing Relajaciones de ejecución definidas por el usuario para la mejora de la programabilidad en computación paralela de altas prestaciones MEMORIA PARA OPTAR AL GRADO DE DOCTOR PRESENTADA POR Andrés Antón Rey Villaverde Directores Francisco Daniel Igual Peña Manuel Prieto Matías Madrid © Andrés Antón Rey Villaverde, 2019 User-defined Execution Relaxations for Enhanced Programmability in High-Performance Parallel Computing – Relajaciones de Ejecucion´ Definidas por el Usuario para la Mejora de la Programabilidad en Computacion´ Paralela de Altas Prestaciones TESIS DOCTORAL Andres´ Anton´ Rey Villaverde Dirigida por: Francisco Daniel Igual Pena˜ y Manuel Prieto Mat´ıas Facultad de Informatica´ Universidad Complutense de Madrid Madrid, 2019 ii User-defined Execution Relaxations for Enhanced Programmability in High-Performance Parallel Computing – Relajaciones de Ejecucion´ Definidas por el Usuario para la Mejora de la Programabilidad en Computacion´ Paralela de Altas Prestaciones Memoria que presenta para optar al t´ıtulo de Doctor en Informatica´ Andres´ Anton´ Rey Villaverde Dirigida por los Doctores Francisco Daniel Igual Pena˜ y Manuel Prieto Mat´ıas Facultad de Informatica´ Universidad Complutense de Madrid Madrid, 2019 iv v vi This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. I hereby declare that all the content presented in this thesis entitled “User-defined Execution Relaxations for Enhanced Programmability in High-Performance Parallel Computing” has been developed by me, and all other content has been appropriately referenced. Andres´ Anton´ Rey Villaverde This work has been supported by the Spanish Ministry of Innovation, Science and Univer- sities under the grants TIN 2015-65277-R, RTI2018-093684-B-I00 and BES-2016-076806, and the Government of Madrid under the grant S2018/TCS-4423. The associated research internships have been supported by the Erasmus+ International Programme and the HiPEAC Network. vii viii Este trabajo esta´ disponible bajo los terminos´ de la Licencia Internacional Creative Commons Atribucion-NoComercial-SinDerivadas´ 4.0. Para ver una copia de esta licencia, visite http://creativecommons.org/licenses/by-nc-nd/4.0/ o env´ıe una carta a Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. Por la presente, declaro que todo el contenido presentado en esta tesis titulada “Relaja- ciones de Ejecucion´ Definidas por el Usuario para la Mejora de la Programabilidad en Computacion´ Paralela de Altas Prestaciones” ha sido desarrollado por m´ı, y cualquier otro contenido ha sido apropiadamente referenciado. Andres´ Anton´ Rey Villaverde Este trabajo ha sido financiado por el Ministerio de Innovacion,´ Ciencia y Universidades bajo los proyectos TIN 2015-65277-R, RTI2018-093684-B-I00 y BES-2016-076806, y por el Gobierno de la Comunidad de Madrid bajo el proyecto S2018/TCS-4423. Las estancias de investigacion´ asociadas a este trabajo han sido financiadas por el programa de intercam- bio Erasmus+ Internacional y por la Red HiPEAC. ix x To my family. xii A mi familia. xiv Acknowledgements I would like to thank my advisors Fran and Manuel for helping me and counting on me during these years, showing me their trust, support and giving me the proper advices at the proper moments. I also greatly thank their support during my research internships abroad, in which I learned so much, and also their trust to let me explore some rather tangential topics, in relation to the core thesis discipline, which however showed to be crucial during the exploratory scientific process. Thanks to Fran, specially for the valuable feedback received throughout this thesis, and for having helped me so much in the achievement of the thesis objectives, encouraging me to get things done in the most important moments. Thanks to Manuel, also for the accurate advices, specially for opening the ArTeCS doors for me, both the first and the second time, for showing me his trust throughout all of these years, not only during the Ph.D. years, and also for counting on me when Ph.D. sponsoring opportunities appeared. Thanks to my family, for the constant support received not only during the development of this thesis, but also for supporting me in those personal decisions in which I prioritized learning, the personal introspection, and the scientific training over other options presumably more ordinary, expected and less precarious (and probably more boring). I want to thank my mother and father, from whom I learned personality traits very important in life and crucial to finish this Ph.D. degree, such as the culture of effort, the importance of education, persistence, and honesty; and also for having prioritized their sons and daughter over anything, always respecting our independence. I thank my brother for his influence in my education, and for passing his ambition on me, so important to visualize the big picture and to keep the motivation alive. I also thank my sister for the support received from the very beginning of these Ph.D. studies and for always passing her positivity on me. Thanks to Amparo, specially for supporting me and bear me during these last stressful months, and for sharing with me such amazing vacations. Thanks to my Cisneros friends Javier, Juan Miguel, Antonio, Alejandro, Xabier, Pablo and Eliseo for keeping loyal to our (increasingly rare) meetings and (increasingly frequent) weddings. Thanks also to my childhood friends Victoria, Manuel, Valent´ın, Jaime, Iago and Domingo, for being more closer than farther after so many years. Thanks to Agathe, for having unconditionally stayed with me in the beginning of this thesis, both in the good and worse moments, and to Dominic, specially for those conversations (initiated at the end of the world five years ago and still maintained), for passing his idealism and motivation on me, and for his interest in the developments of this thesis. Thanks to my industrial friends Angel´ Rosso, Elena Saiz, Elena Garzon,´ Mar Robledo, Alberto Palomar and Laura Vallejo and to my Impanati / Jamadan friends, Marco, Davide and Javier, for those meetings, beers and rehearsals that helped me so much to get away from the thesis when I needed it the most. Thanks also to the Impanati guys to bear with patience the rehearsal interruptions during my internships abroad. Thanks to the people in the Computer Architecture division, starting from Inaki˜ and Manuel, who opened the ArTeCS doors for me, and specially to all those people I had the pleasure to meet throughout these years, such as Daniel Tabas, Jorge Quintas,´ Roberto Cano, Luis Costero, Nacho Gomez,´ Javier Setoain, Edgardo Mej´ıa, Juan Carlos Saez,´ Luis Pinuel,˜ Christian Tenllado, Fernando Castro, Guillermo Botella, Katzalin Olcoz, Rafael Sanchez,´ Joaqu´ın Recas and Mar´ıa Guijarro. Thanks to Jan Prins for accepting to be my advisor in Chapel Hill, for inviting me to his home, xv xvi and for our discussions and his accurate insights that identified the limitations of my ideas, also helping me to address them. Thanks to the people I met in North Carolina, specially Joshua, Christian and Camila, for the interesting conversations and for making my internship so much fun. Thanks to the Codeplay people, Ruyman, Marya, Peter, Marios, Alex, Gordon, Uwe and Christo- pher for giving me the opportunity to work and learn from them, giving me such an excellent research experience in Edimburgh, which helped me so much to focus my further developments. Recalling my beginnings in computation and simulation worlds, I want to thank the professors that helped me to get experience in numerical methods applied to computational physics. I want to thank Carlos Spa, for giving me the opportunity to work and learn from him in Chile, and for encouraging me to start the Ph.D. studies. I also want to thank V´ıctor Mart´ın, who initiated me in the fascinating world of statistical mechanics (to which I will return); and Leo Gonzalez,´ who initiated me in computational fluid dynamics, appreciating my motivation over my previous experience. I also want to thank the best university professors I had, who have reinforced my passion for learning, also initiating me in the ways of the science, who have greatly inspired me during my studies of engineering and physics. Despite several years have passed, I still have vivid memories of their lectures and the sensations that they awakened on me back then, which years later have somehow guided me toward starting Ph.D. studies. They are Enrique Macia,´ Estrella Alonso, Angela´ Jimenez,´ Juan Pedro Villaluenga, Luis Garay, Felipe Llanes, Jose´ Ramon´ Pelaez and again V´ıctor Mart´ın. I want to thank the reviewers Aleksandar and Ricardo for their valuable suggestions that, together with my advisors, have greatly contributed toward enhancing the quality of the current dissertation. Moreover, I want to thank the anonymous reviewers of the published articles, as their feedback has also guided the research conducted in this thesis. I also want to thank in a general sense the developer communities that are constantly pushing computer technology to new heights in a passionate and idealistic way, either developing open source tools, contributing to programming language standardizations, and also producing documentation and disseminating it in open and free media. Specifically, the developments in this thesis depart to a great extent from the Standard C++ committee works, and some of the proposed ideas exposed in this thesis would not have been possible without all the new functionalities incorporated in modern C++ standards. Agradecimientos Quisiera agradecer a mis directores Fran y Manuel haberme ayudado tanto y contar conmigo durante estos anos,˜ dandome´ la confianza, el apoyo y los consejos en los momentos adecuados.
Recommended publications
  • Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)
    1 Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX) Patrick Diehl, Madhavan Seshadri, Thomas Heller, Hartmut Kaiser Abstract—Experience shows that on today’s high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous invocation of CUDA kernels on this data. Both operations are well integrated into the general programming model of HPX which allows to seamlessly overlap any GPU operation with work on the main cores. Any user defined CUDA kernel can be launched on any (local or remote) GPU device available to the distributed application. We present asynchronous implementations for the data transfers and kernel launches for CUDA code as part of a HPX asynchronous execution graph. Using this approach we can combine all remotely and locally available acceleration cards on a cluster to utilize its full performance capabilities.
    [Show full text]
  • The Importance of Data
    The landscape of Parallel Programing Models Part 2: The importance of Data Michael Wong and Rod Burns Codeplay Software Ltd. Distiguished Engineer, Vice President of Ecosystem IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.
    [Show full text]
  • Introduction to GPU Computing
    Introduction to GPU Computing J. Austin Harris Scientific Computing Group Oak Ridge National Laboratory ORNL is managed by UT-Battelle for the US Department of Energy Performance Development in Top500 • Yardstick for measuring 1 Exaflop/s performance in HPC – Solve Ax = b – Measure floating-point operations per second (Flop/s) • U.S. targeting Exaflop system as early as 2022 – Building on recent trend of using GPUs https://www.top500.org/statistics/perfdevel 2 J. Austin Harris --- JETSCAPE --- 2020 Hardware Trends • Scaling number of cores/chip instead of clock speed • Power is the root cause – Power density limits clock speed • Goal has shifted to performance through parallelism • Performance is now a software concern Figure from Kathy Yelick, “Ten Ways to Waste a Parallel Computer.” Data from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç. 3 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • Excellent at graphics rendering – Fast computation (e.g., TV refresh rate) – High degree of parallelism (millions of independent pixels) – Needs high memory bandwidth • Often sacrifices latency, but this can be ameliorated • This computation pattern common in many scientific applications 4 J. Austin Harris --- JETSCAPE --- 2020 GPUs for Computation • CPU Strengths • GPU Strengths – Large memory – High mem. BW – Fast clock speeds – Latency tolerant via – Large cache for parallelism latency optimization – More compute – Small number of resources (cores) threads that can run – High perf./watt very quickly • GPU Weaknesses • CPU Weaknesses – Low mem. Capacity – Low mem. bandwidth – Low per-thread perf. – Costly cache misses – Low perf./watt Slide from Jeff Larkin, “Fundamentals of GPU Computing” 5 J.
    [Show full text]
  • Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing
    MATRIX: Bench - Benchmarking the state-of- the-art Task Execution Frameworks of Many- Task Computing Thomas Dubucq, Tony Forlini, Virgile Landeiro Dos Reis, and Isabelle Santos Illinois Institute of Technology, Chicago, IL, USA {tdubucq, tforlini, vlandeir, isantos1}@hawk.iit.edu Stanford University. Finally HPX is a general purpose C++ Abstract — Technology trends indicate that exascale systems will runtime system for parallel and distributed applications of any have billion-way parallelism, and each node will have about three scale developed by Louisiana State University and Staple is a orders of magnitude more intra-node parallelism than today’s framework for developing parallel programs from Texas A&M. peta-scale systems. The majority of current runtime systems focus a great deal of effort on optimizing the inter-node parallelism by MATRIX is a many-task computing job scheduling system maximizing the bandwidth and minimizing the latency of the use [3]. There are many resource managing systems aimed towards of interconnection networks and storage, but suffer from the lack data-intensive applications. Furthermore, distributed task of scalable solutions to expose the intra-node parallelism. Many- scheduling in many-task computing is a problem that has been task computing (MTC) is a distributed fine-grained paradigm that considered by many research teams. In particular, Charm++ [4], aims to address the challenges of managing parallelism and Legion [5], Swift [6], [10], Spark [1][2], HPX [12], STAPL [13] locality of exascale systems. MTC applications are typically structured as direct acyclic graphs of loosely coupled short tasks and MATRIX [11] offer solutions to this problem and have with explicit input/output data dependencies.
    [Show full text]
  • HPX – a Task Based Programming Model in a Global Address Space
    HPX – A Task Based Programming Model in a Global Address Space Hartmut Kaiser1 Thomas Heller2 Bryce Adelstein-Lelbach1 [email protected] [email protected] [email protected] Adrian Serio1 Dietmar Fey2 [email protected] [email protected] 1Center for Computation and 2Computer Science 3, Technology, Computer Architectures, Louisiana State University, Friedrich-Alexander-University, Louisiana, U.S.A. Erlangen, Germany ABSTRACT 1. INTRODUCTION The significant increase in complexity of Exascale platforms due to Todays programming models, languages, and related technologies energy-constrained, billion-way parallelism, with major changes to that have sustained High Performance Computing (HPC) appli- processor and memory architecture, requires new energy-efficient cation software development for the past decade are facing ma- and resilient programming techniques that are portable across mul- jor problems when it comes to programmability and performance tiple future generations of machines. We believe that guarantee- portability of future systems. The significant increase in complex- ing adequate scalability, programmability, performance portability, ity of new platforms due to energy constrains, increasing paral- resilience, and energy efficiency requires a fundamentally new ap- lelism and major changes to processor and memory architecture, proach, combined with a transition path for existing scientific ap- requires advanced programming techniques that are portable across plications, to fully explore the rewards of todays and tomorrows multiple future generations of machines [1]. systems. We present HPX – a parallel runtime system which ex- tends the C++11/14 standard to facilitate distributed operations, A fundamentally new approach is required to address these chal- enable fine-grained constraint based parallelism, and support run- lenges.
    [Show full text]
  • Intel® Oneapi Programming Guide
    Intel® oneAPI Programming Guide Intel Corporation www.intel.com Notices and Disclaimers Contents Notices and Disclaimers....................................................................... 5 Chapter 1: Introduction oneAPI Programming Model Overview ..........................................................7 Data Parallel C++ (DPC++)................................................................8 oneAPI Toolkit Distribution..................................................................9 About This Guide.......................................................................................9 Related Documentation ..............................................................................9 Chapter 2: oneAPI Programming Model Sample Program ..................................................................................... 10 Platform Model........................................................................................ 14 Execution Model ...................................................................................... 15 Memory Model ........................................................................................ 17 Memory Objects.............................................................................. 19 Accessors....................................................................................... 19 Synchronization .............................................................................. 20 Unified Shared Memory.................................................................... 20 Kernel Programming
    [Show full text]
  • Pattern Matching
    Functional Programming Steven Lau March 2015 before function programming... https://www.youtube.com/watch?v=92WHN-pAFCs Models of computation ● Turing machine ○ invented by Alan Turing in 1936 ● Lambda calculus ○ invented by Alonzo Church in 1930 ● more... Turing machine ● A machine operates on an infinite tape (memory) and execute a program stored ● It may ○ read a symbol ○ write a symbol ○ move to the left cell ○ move to the right cell ○ change the machine’s state ○ halt Turing machine Have some fun http://www.google.com/logos/2012/turing-doodle-static.html http://www.ioi2012.org/wp-content/uploads/2011/12/Odometer.pdf http://wcipeg.com/problems/desc/ioi1211 Turing machine incrementer state symbol action next_state _____ state 0 __1__ state 1 0 _ or 0 write 1 1 _10__ state 2 __1__ state 1 0 1 write 0 2 _10__ state 0 __1__ state 0 _00__ state 2 1 _ left 0 __0__ state 2 _00__ state 0 __0__ state 0 1 0 or 1 right 1 100__ state 1 _10__ state 1 2 0 left 0 100__ state 1 _10__ state 1 100__ state 1 _10__ state 1 100__ state 1 _10__ state 0 100__ state 0 _11__ state 1 101__ state 1 _11__ state 1 101__ state 1 _11__ state 0 101__ state 0 λ-calculus Beware! ● think mathematical, not C++/Pascal ● (parentheses) are for grouping ● variables cannot be mutated ○ x = 1 OK ○ x = 2 NO ○ x = x + 1 NO λ-calculus Simplification 1 of 2: ● Only anonymous functions are used ○ f(x) = x2+1 f(1) = 12+1 = 2 is written as ○ (λx.x2+1)(1) = 12+1 = 2 note that f = λx.x2+1 λ-calculus Simplification 2 of 2: ● Only unary functions are used ○ a binary function can be written as a unary function that return another unary function ○ (λ(x,y).x+y)(1,2) = 1+2 = 3 is written as [(λx.(λy.x+y))(1)](2) = [(λy.1+y)](2) = 1+2 = 3 ○ this technique is known as Currying Haskell Curry λ-calculus ● A lambda term has 3 forms: ○ x ○ λx.A ○ AB where x is a variable, A and B are lambda terms.
    [Show full text]
  • Opencl SYCL 2.2 Specification
    SYCLTM Provisional Specification SYCL integrates OpenCL devices with modern C++ using a single source design Version 2.2 Revision Date: – 2016/02/15 Khronos OpenCL Working Group — SYCL subgroup Editor: Maria Rovatsou Copyright 2011-2016 The Khronos Group Inc. All Rights Reserved Contents 1 Introduction 13 2 SYCL Architecture 15 2.1 Overview . 15 2.2 The SYCL Platform and Device Model . 15 2.2.1 Platform Mixed Version Support . 16 2.3 SYCL Execution Model . 17 2.3.1 Execution Model: Queues, Command Groups and Contexts . 18 2.3.2 Execution Model: Command Queues . 19 2.3.3 Execution Model: Mapping work-items onto an nd range . 21 2.3.4 Execution Model: Execution of kernel-instances . 22 2.3.5 Execution Model: Hierarchical Parallelism . 24 2.3.6 Execution Model: Device-side enqueue . 25 2.3.7 Execution Model: Synchronization . 25 2.4 Memory Model . 26 2.4.1 Access to memory . 27 2.4.2 Memory consistency . 28 2.4.3 Atomic operations . 28 2.5 The SYCL programming model . 28 2.5.1 Basic data parallel kernels . 30 2.5.2 Work-group data parallel kernels . 30 2.5.3 Hierarchical data parallel kernels . 30 2.5.4 Kernels that are not launched over parallel instances . 31 2.5.5 Synchronization . 31 2.5.6 Error handling . 32 2.5.7 Scheduling of kernels and data movement . 32 2.5.8 Managing object lifetimes . 34 2.5.9 Device discovery and selection . 35 2.5.10 Interfacing with OpenCL . 35 2.6 Anatomy of a SYCL application .
    [Show full text]
  • Lambda Calculus and Functional Programming
    Global Journal of Researches in Engineering Vol. 10 Issue 2 (Ver 1.0) June 2010 P a g e | 47 Lambda Calculus and Functional Programming Anahid Bassiri1Mohammad Reza. Malek2 GJRE Classification (FOR) 080299, 010199, 010203, Pouria Amirian3 010109 Abstract-The lambda calculus can be thought of as an idealized, Basis concept of a Turing machine is the present day Von minimalistic programming language. It is capable of expressing Neumann computers. Conceptually these are Turing any algorithm, and it is this fact that makes the model of machines with random access registers. Imperative functional programming an important one. This paper is programming languages such as FORTRAN, Pascal etcetera focused on introducing lambda calculus and its application. As as well as all the assembler languages are based on the way an application dikjestra algorithm is implemented using a Turing machine is instructed by a sequence of statements. lambda calculus. As program shows algorithm is more understandable using lambda calculus in comparison with In addition functional programming languages, like other imperative languages. Miranda, ML etcetera, are based on the lambda calculus. Functional programming is a programming paradigm that I. INTRODUCTION treats computation as the evaluation of mathematical ambda calculus (λ-calculus) is a useful device to make functions and avoids state and mutable data. It emphasizes L the theories realizable. Lambda calculus, introduced by the application of functions, in contrast with the imperative Alonzo Church and Stephen Cole Kleene in the 1930s is a programming style that emphasizes changes in state. formal system designed to investigate function definition, Lambda calculus provides a theoretical framework for function application and recursion in mathematical logic and describing functions and their evaluation.
    [Show full text]
  • Of the Threading Building Blocks Flow Graph API, a C++ API for Expressing Dependency, Streaming and Data Flow Applications
    Episode 15: A Proving Ground for Open Standards Host: Nicole Huesman, Intel Guests: Mike Voss, Intel; Andrew Lumsdaine, Northwest Institute for Advanced Computing ___________________________________________________________________________ Nicole Huesman: Welcome to Code Together, an interview series exploring the possibilities of cross- architecture development with those who live it. I’m your host, Nicole Huesman. In earlier episodes, we’ve talked about the need to make it easier for developers to build code for heterogeneous environments in the face of increasingly diverse and data-intensive workloads. And the industry shift to modern C++ with the C++11 release. Today, we’ll continue that conversation, exploring parallelism and heterogeneous computing from a user’s perspective with: Andrew Lumsdaine. As Chief Scientist at the Northwest Institute for Advanced Computing, Andrew wears at least two hats: Laboratory Fellow at Pacific Northwest National Lab, and Affiliate Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. By spanning a university and a national lab, Andrew has the opportunity to work on research questions, and then translate those results into practice. His primary research interest is High Performance Computing, with a particular attention to scalable graph algorithms. I should also note that Andrew is a member of the oneAPI Technical Advisory Board. Andrew, so great to have you with us! Andrew Lumsdaine: Thank you. It’s great to be here. Nicole Huesman: And Mike Voss. Mike is a Principal Engineer at Intel, and was the original architect of the Threading Building Blocks flow graph API, a C++ API for expressing dependency, streaming and data flow applications.
    [Show full text]
  • High-Level and Efficient Stream Parallelism on Multi-Core Systems
    WSCAD 2017 - XVIII Simp´osio em Sistemas Computacionais de Alto Desempenho High-Level and Efficient Stream Parallelism on Multi-core Systems with SPar for Data Compression Applications Dalvan Griebler1, Renato B. Hoffmann1, Junior Loff1, Marco Danelutto2, Luiz Gustavo Fernandes1 1 Faculty of Informatics (FACIN), Pontifical Catholic University of Rio Grande do Sul (PUCRS), GMAP Research Group, Porto Alegre, Brazil. 2Department of Computer Science, University of Pisa (UNIPI), Pisa, Italy. {dalvan.griebler, renato.hoffmann, junior.loff}@acad.pucrs.br, [email protected], [email protected] Abstract. The stream processing domain is present in several real-world appli- cations that are running on multi-core systems. In this paper, we focus on data compression applications that are an important sub-set of this domain. Our main goal is to assess the programmability and efficiency of domain-specific language called SPar. It was specially designed for expressing stream paral- lelism and it promises higher-level parallelism abstractions without significant performance losses. Therefore, we parallelized Lzip and Bzip2 compressors with SPar and compared with state-of-the-art frameworks. The results revealed that SPar is able to efficiently exploit stream parallelism as well as provide suit- able abstractions with less code intrusion and code re-factoring. 1. Introduction Over the past decade, vendors realized that increasing clock frequency to gain perfor- mance was no longer possible. Companies were then forced to slow the clock frequency and start adding multiple processors to their chips. Since that, software started to rely on parallel programming to increase performance [Sutter 2005]. However, exploiting par- allelism in such multi-core architectures is a challenging task that is still too low-level and complex for application programmers.
    [Show full text]
  • Purity in Erlang
    Purity in Erlang Mihalis Pitidis1 and Konstantinos Sagonas1,2 1 School of Electrical and Computer Engineering, National Technical University of Athens, Greece 2 Department of Information Technology, Uppsala University, Sweden [email protected], [email protected] Abstract. Motivated by a concrete goal, namely to extend Erlang with the abil- ity to employ user-defined guards, we developed a parameterized static analysis tool called PURITY, that classifies functions as referentially transparent (i.e., side- effect free with no dependency on the execution environment and never raising an exception), side-effect free with no dependencies but possibly raising excep- tions, or side-effect free but with possible dependencies and possibly raising ex- ceptions. We have applied PURITY on a large corpus of Erlang code bases and report experimental results showing the percentage of functions that the analysis definitely classifies in each category. Moreover, we discuss how our analysis has been incorporated on a development branch of the Erlang/OTP compiler in order to allow extending the language with user-defined guards. 1 Introduction Purity plays an important role in functional programming languages as it is a corner- stone of referential transparency, namely that the same language expression produces the same value when evaluated twice. Referential transparency helps in writing easy to test, robust and comprehensible code, makes equational reasoning possible, and aids program analysis and optimisation. In pure functional languages like Clean or Haskell, any side-effect or dependency on the state is captured by the type system and is reflected in the types of functions. In a language like ERLANG, which has been developed pri- marily with concurrency in mind, pure functions are not the norm and impure functions can freely be used interchangeably with pure ones.
    [Show full text]