Evaluation of Programming Models for Manycore And/Or Heterogeneous

Total Page:16

File Type:pdf, Size:1020Kb

Evaluation of Programming Models for Manycore And/Or Heterogeneous Evaluation of programming models for manycore and / or heterogeneous architectures for Monte Carlo neutron transport codes Tao Chang To cite this version: Tao Chang. Evaluation of programming models for manycore and / or heterogeneous architectures for Monte Carlo neutron transport codes. Computer science. Institut Polytechnique de Paris, 2020. English. NNT : 2020IPPAX099. tel-03086536 HAL Id: tel-03086536 https://tel.archives-ouvertes.fr/tel-03086536 Submitted on 22 Dec 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Evaluation of programming models for manycore and / or heterogeneous architectures for Monte Carlo neutron NNT : 2020IPPAX099 transport codes Thèse de doctorat de l’Institut Polytechnique de Paris préparée à l’École polytechnique École doctorale n◦626 École doctorale de l’Institut Polytechnique de Paris (ED IP Paris) Spécialité de doctorat: Informatique, Données et Intelligence Artificielle Thèse présentée et soutenue à Saclay, le 01/12/2020, par TAO CHANG Composition du Jury : Michael HEROUX Professeur, Sandia National Laboratories and St. John’s University Rapporteur Jean-François MEHAUT Professeur, Université Grenoble Alpes Rapporteur Raymond NAMYST Professeur, Université de Bordeaux Examinateur Marc VERDERI Directeur de recherche, Laboratoire Leprince Ringuet de l’IPP Président Emmanuelle SAILLARD Chargé de recherche, Le Centre de Recherche INRIA Bordeaux Examinateur Francieli ZANON BOITO Maitre de Conférences, Université de Bordeaux Examinateur 626 Christophe CALVIN Expert International, CEA Saclay Directeur de thèse Emeric BRUN Ingénieur de recherche, CEA Centre Marcoule Encadrant This thesis is dedicated to my family who have raised me to be the person I am today. I hope that this achievement makes you happy. Acknowledgements First and foremost, I would like to express my deep and sincere gratitude to my thesis supervisor, Christophe CALVIN, for the continuous support of my Ph.D study throughout the past three years, for his motivation, enthusiasm and vision which have deeply inspired me every time I got stuck on research. I am extremely grateful for having the opportunity to study and work under his guidance. I would like to thank my advisor, Emeric BRUN, for his patience, insightful comments and hardware support. I am grateful for his aids on helping me carrying out tests on the Cobalt-hybrid, Cobalt-v100 platforms. Moreover, I thank him and Philippe Thierry for providing me an Intel NUC machine to accomplish my research work. I thank the rest of my thesis committee: Michael HEROUX and Jean-François MEHAUT, for being my reviewers and writing me the literature of review and Raymond NAMYST, Marc VERDERI, Emmanuelle SAILLARD, Francieli ZANON BOITO, for being the examiners at my thesis defense. My sincere thanks also goes to Fausto MALVAGI, the director of my laboratory SERMA/LTSD in CEA Saclay and Edouard Audit, the director of Maison de la Simulation, for allowing me studying and working in two places throughout the research. I would also like to express my gratitude to my fellow labmates in these two laboratories, for their administrative, academical and hardware supports as well as insightful talks with me. I would like to thank David Chamont for offering me access to the GridCL platform and being my academical advisor. I thank Pierre Kestener for giving his useful comments on the implementation of Kokkos in the Monte Carlo neutron transport code. I thank Serge G. Petiton, for guiding my master internship in Maison de la Simulation and enlightening me the first glance of research in the field of HPC. I am very much thankful to my parents and twin brother for their love and sacrifices for supporting me continu- ously to chase my dream. Titre : Evaluation´ de modeles` de programmation pour les architectures manycore et / ou het´ erog´ enes` pour les codes de transport neutronique Monte Carlo Mots cles´ : Architectures manycore, Architectures het´ erog´ enes,` Transport des particules Resum´ e:´ Dans cette these` nous nous proposons de solutions et les comparer en terme de perfor- d’evaluer´ les differents´ modeles` de programmation mance, de portabilite´ de la performance, de facilite´ disponibles pour adresser les architectures de type de mise en œuvre et de maintenabilite.´ Les architec- manycore et / ou het´ erog´ enes` dans le cadre des tures cibles sont les CPU ‘classique’, Intel Xeon Phi codes de transport Monte Carlo. On considerera` dans et GPU. Les modeles` de programmation les plus per- un premier temps un cas test d’application simple tinents seront ensuite mis en place dans un code de mais representatif´ pour couvrir un eventail´ assez large transport Monte Carlo. Title : Evaluation of programming models for manycore and / or heterogeneous architectures for Monte Carlo neutron transport codes Keywords : Manycore architectures, Heterogeneous architectures, Particles transport Abstract : In this thesis we propose to evaluate the range of solutions and compare them in terms of different programming models available for addres- performance, portability of performance, ease of im- sing manycore and / or heterogeneous architectures plementation and maintainability. The target architec- within the framework of the Monte Carlo transport tures are ‘classic’ CPU, Intel Xeon Phi and GPU. The codes. A simple but representative application test most relevant programming models will then be set up case will be considered in order to cover a fairly wide in a Monte Carlo transport code. Institut Polytechnique de Paris 91120 Palaiseau, France Résumé La simulation du transport de neutrons de Monte-Carlo est une méthode stochastique largement utilisée dans le domaine nucléaire pour les calculs de référence. Au lieu d’introduire des approximations mathématiques et physiques pour simuler un système du monde réel, la méthode de Monte-Carlo regroupe les comportements des particules par échantillonnage statistique, ce qui entraîne de manière significative un coût de calcul énorme. Afin d’alléger ce coût, l’utilisation de supercalculateurs pour les simulations de Monte-Carlo est devenue une tendance pour de nombreux chercheurs et développeurs. Cependant, comme de plus en plus d’architectures modernes (multi-core / manycore, architectures hétérogènes) émergent à un rythme assez rapide, il n’est pas anodin de développer des applications portables complètes sur toutes ces architectures, de plus, d’optimiser et de maintenir l’application en continu donc pour obtenir des perfor- mances préférables sur eux. D’un côté, le développement et la durée de vie d’une application sont bien supérieurs au temps de développe- ment d’architectures informatiques. Ainsi, concevoir une application pour une architecture donnée n’a pas de sens car après le développement de l’application, il y a de fortes chances que l’architecture ait déjà évolué. De l’autre côté, il est nécessaire que les applications soient évolutives et maintenables par le plus de personnes possible, per- mettant à ces personnes qui ne sont pas familiarisées avec le HPC de développer des codes capables de fournir de bonnes performances. Pour répondre à l’évolution de la portabilité des performances, certains langages de programmation de haut niveau tels que OpenACC, OpenMP offload ont été proposés pour permettre à l’application de bien fonctionner sur une large gamme d’architectures sans de gros frais de développement et de maintenance. La plupart de ces modèles de programmation sont dédiés aux architectures basées sur des accélérateurs avec l’augmentation massive du parallélisme multi-niveaux qui représente la tendance sous-jacente des architectures de calcul. Dans cette thèse, nous nous intéressons au développement d’un code de transport de neutrons Monte-Carlo portable ciblant des architectures hybrides (CPU + GPU) avec l’utilisation de modèles de programmation hybrides. Concernant l’évaluation de la portabilité des performances, certaines métriques ont déjà été proposées pour évaluer quantitativement le compromis entre performance et portabilité. Cependant, il existe peu de recherches traitant de l’évaluation des codes de transport de neutrons de Monte-Carlo sur les supercalculateurs en termes de portabilité des performances, en outre, de portabilité des performances productibles. Ainsi, nous avons l’intention de donner une évaluation explicite en termes de portabilité, de performance, ainsi que de productivité pour un mini-benchmark de simulation de transport de neutrons de Monte-Carlo. Le mini-benchmark slabAllNuclides a été implémenté sur un prototype de transport de neutrons de Monte- Carlo développé au CEA, appelé PATMOS. Une version de déchargement hétérogène du code a été développée via une méthode basée sur l’historique et une méthode basée sur des pseudo événements. Plusieurs modèles de programmation (thread OpenMP, OpenMP offload, OpenACC, CUDA, Kokkos et SYCL) ont été implémentés pour l’évaluation et les tests ont été réalisés sur différentes architectures (x86 ou OpenPower + GPU). Un ensemble de métriques a été adopté pour évaluer notre simulation Monte-Carlo de
Recommended publications
  • A Case for High Performance Computing with Virtual Machines
    A Case for High Performance Computing with Virtual Machines Wei Huangy Jiuxing Liuz Bulent Abaliz Dhabaleswar K. Panday y Computer Science and Engineering z IBM T. J. Watson Research Center The Ohio State University 19 Skyline Drive Columbus, OH 43210 Hawthorne, NY 10532 fhuanwei, [email protected] fjl, [email protected] ABSTRACT in the 1960s [9], but are experiencing a resurgence in both Virtual machine (VM) technologies are experiencing a resur- industry and research communities. A VM environment pro- gence in both industry and research communities. VMs of- vides virtualized hardware interfaces to VMs through a Vir- fer many desirable features such as security, ease of man- tual Machine Monitor (VMM) (also called hypervisor). VM agement, OS customization, performance isolation, check- technologies allow running different guest VMs in a phys- pointing, and migration, which can be very beneficial to ical box, with each guest VM possibly running a different the performance and the manageability of high performance guest operating system. They can also provide secure and computing (HPC) applications. However, very few HPC ap- portable environments to meet the demanding requirements plications are currently running in a virtualized environment of computing resources in modern computing systems. due to the performance overhead of virtualization. Further, Recently, network interconnects such as InfiniBand [16], using VMs for HPC also introduces additional challenges Myrinet [24] and Quadrics [31] are emerging, which provide such as management and distribution of OS images. very low latency (less than 5 µs) and very high bandwidth In this paper we present a case for HPC with virtual ma- (multiple Gbps).
    [Show full text]
  • Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++
    Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++ Christopher S. Zakian, Timothy A. K. Zakian Abhishek Kulkarni, Buddhika Chamith, and Ryan R. Newton Indiana University - Bloomington, fczakian, tzakian, adkulkar, budkahaw, [email protected] Abstract. Library and language support for scheduling non-blocking tasks has greatly improved, as have lightweight (user) threading packages. How- ever, there is a significant gap between the two developments. In previous work|and in today's software packages|lightweight thread creation incurs much larger overheads than tasking libraries, even on tasks that end up never blocking. This limitation can be removed. To that end, we describe an extension to the Intel Cilk Plus runtime system, Concurrent Cilk, where tasks are lazily promoted to threads. Concurrent Cilk removes the overhead of thread creation on threads which end up calling no blocking operations, and is the first system to do so for C/C++ with legacy support (standard calling conventions and stack representations). We demonstrate that Concurrent Cilk adds negligible overhead to existing Cilk programs, while its promoted threads remain more efficient than OS threads in terms of context-switch overhead and blocking communication. Further, it enables development of blocking data structures that create non-fork-join dependence graphs|which can expose more parallelism, and better supports data-driven computations waiting on results from remote devices. 1 Introduction Both task-parallelism [1, 11, 13, 15] and lightweight threading [20] libraries have become popular for different kinds of applications. The key difference between a task and a thread is that threads may block|for example when performing IO|and then resume again.
    [Show full text]
  • Parallel Computing a Key to Performance
    Parallel Computing A Key to Performance Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India http://www.cse.iitd.ac.in/~dheerajb Dheeraj Bhardwaj <[email protected]> August, 2002 1 Introduction • Traditional Science • Observation • Theory • Experiment -- Most expensive • Experiment can be replaced with Computers Simulation - Third Pillar of Science Dheeraj Bhardwaj <[email protected]> August, 2002 2 1 Introduction • If your Applications need more computing power than a sequential computer can provide ! ! ! ❃ Desire and prospect for greater performance • You might suggest to improve the operating speed of processors and other components. • We do not disagree with your suggestion BUT how long you can go ? Can you go beyond the speed of light, thermodynamic laws and high financial costs ? Dheeraj Bhardwaj <[email protected]> August, 2002 3 Performance Three ways to improve the performance • Work harder - Using faster hardware • Work smarter - - doing things more efficiently (algorithms and computational techniques) • Get help - Using multiple computers to solve a particular task. Dheeraj Bhardwaj <[email protected]> August, 2002 4 2 Parallel Computer Definition : A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”. Driving Forces and Enabling Factors Desire and prospect for greater performance Users have even bigger problems and designers have even more gates Dheeraj Bhardwaj <[email protected]>
    [Show full text]
  • Exploring Massive Parallel Computation with GPU
    Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Exploring Massive Parallel Computation with GPU Ian Bond Massey University, Auckland, New Zealand 2011 Sagan Exoplanet Workshop Pasadena, July 25-29 2011 Ian Bond | Microlensing parallelism 1/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Assumptions/Purpose You are all involved in microlensing modelling and you have (or are working on) your own code this lecture shows how to get started on getting code to run on a GPU then its over to you . Ian Bond | Microlensing parallelism 2/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Outline 1 Need for parallelism 2 Graphical Processor Units 3 Gravitational Microlensing Modelling Ian Bond | Microlensing parallelism 3/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Paralel Computing Parallel Computing is use of multiple computers, or computers with multiple internal processors, to solve a problem at a greater computational speed than using a single computer (Wilkinson 2002). How does one achieve parallelism? Ian Bond | Microlensing parallelism 4/40 Need for parallelism Graphical Processor Units Gravitational Microlensing Modelling Grand Challenge Problems A grand challenge problem is one that cannot be solved in a reasonable amount of time with todays computers’ Examples: – Modelling large DNA structures – Global weather forecasting – N body problem (N very large) – brain simulation Has microlensing
    [Show full text]
  • Dcuda: Hardware Supported Overlap of Computation and Communication
    dCUDA: Hardware Supported Overlap of Computation and Communication Tobias Gysi Jeremia Bar¨ Torsten Hoefler Department of Computer Science Department of Computer Science Department of Computer Science ETH Zurich ETH Zurich ETH Zurich [email protected] [email protected] [email protected] Abstract—Over the last decade, CUDA and the underlying utilization of the costly compute and network hardware. To GPU hardware architecture have continuously gained popularity mitigate this problem, application developers can implement in various high-performance computing application domains manual overlap of computation and communication [23], [27]. such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent In particular, there exist various approaches [13], [22] to programming model for GPU clusters. We therefore introduce overlap the communication with the computation on an inner the dCUDA programming model, which implements device- domain that has no inter-node data dependencies. However, side remote memory access with target notification. To hide these code transformations significantly increase code com- instruction pipeline latencies, CUDA programs over-decompose plexity which results in reduced real-world applicability. the problem and over-subscribe the device by running many more threads than there are hardware execution units. Whenever a High-performance system design often involves trading thread stalls, the hardware scheduler immediately proceeds with off sequential performance against parallel throughput. The the execution of another thread ready for execution. This latency architectural difference between host and device processors hiding technique is key to make best use of the available hardware perfectly showcases the two extremes of this design space.
    [Show full text]
  • Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions
    Control Replication: Compiling Implicit Parallelism to Efficient SPMD with Logical Regions Elliott Slaughter Wonchan Lee Sean Treichler Stanford University Stanford University Stanford University SLAC National Accelerator Laboratory [email protected] NVIDIA [email protected] [email protected] Wen Zhang Michael Bauer Galen Shipman Stanford University NVIDIA Los Alamos National Laboratory [email protected] [email protected] [email protected] Patrick McCormick Alex Aiken Los Alamos National Laboratory Stanford University [email protected] [email protected] ABSTRACT 1 for t = 0, T do We present control replication, a technique for generating high- 1 for i = 0, N do −− Parallel 2 for i = 0, N do −− Parallel performance and scalable SPMD code from implicitly parallel pro- 2 for t = 0, T do 3 B[i] = F(A[i]) grams. In contrast to traditional parallel programming models that 3 B[i] = F(A[i]) 4 end 4 −− Synchronization needed require the programmer to explicitly manage threads and the com- 5 for j = 0, N do −− Parallel 5 A[i] = G(B[h(i)]) munication and synchronization between them, implicitly parallel 6 A[j] = G(B[h(j)]) 6 end programs have sequential execution semantics and by their nature 7 end 7 end avoid the pitfalls of explicitly parallel programming. However, with- 8 end (b) Transposed program. out optimizations to distribute control overhead, scalability is often (a) Original program. poor. Performance on distributed-memory machines is especially sen- F(A[0]) . G(B[h(0)]) sitive to communication and synchronization in the program, and F(A[1]) . G(B[h(1)]) thus optimizations for these machines require an intimate un- .
    [Show full text]
  • Regent: a High-Productivity Programming Language for Implicit Parallelism with Logical Regions
    REGENT: A HIGH-PRODUCTIVITY PROGRAMMING LANGUAGE FOR IMPLICIT PARALLELISM WITH LOGICAL REGIONS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Elliott Slaughter August 2017 © 2017 by Elliott David Slaughter. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/mw768zz0480 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Alex Aiken, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Philip Levis I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Oyekunle Olukotun Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Modern supercomputers are dominated by distributed-memory machines. State of the art high-performance scientific applications targeting these machines are typically written in low-level, explicitly parallel programming models that enable maximal performance but expose the user to programming hazards such as data races and deadlocks.
    [Show full text]
  • On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the Gvirtus Framework
    On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework Montella, R., Giunta, G., Laccetti, G., Lapegna, M., Palmieri, C., Ferraro, C., Pelliccia, V., Hong, C-H., Spence, I., & Nikolopoulos, D. (2017). On the Virtualization of CUDA Based GPU Remoting on ARM and X86 Machines in the GVirtuS Framework. International Journal of Parallel Programming, 45(5), 1142-1163. https://doi.org/10.1007/s10766-016-0462-1 Published in: International Journal of Parallel Programming Document Version: Peer reviewed version Queen's University Belfast - Research Portal: Link to publication record in Queen's University Belfast Research Portal Publisher rights © 2016 Springer Verlag. The final publication is available at Springer via http://dx.doi.org/ 10.1007/s10766-016-0462-1 General rights Copyright for the publications made accessible via the Queen's University Belfast Research Portal is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights. Take down policy The Research Portal is Queen's institutional repository that provides access to Queen's research output. Every effort has been made to ensure that content in the Research Portal does not infringe any person's rights, or applicable UK laws. If you discover content in the Research Portal that you believe breaches copyright or violates any law, please contact [email protected]. Download date:08. Oct. 2021 Noname manuscript No. (will be inserted by the editor) On the virtualization of CUDA based GPU remoting on ARM and X86 machines in the GVirtuS framework Raffaele Montella · Giulio Giunta · Giuliano Laccetti · Marco Lapegna · Carlo Palmieri · Carmine Ferraro · Valentina Pelliccia · Cheol-Ho Hong · Ivor Spence · Dimitrios S.
    [Show full text]
  • HPVM: Heterogeneous Parallel Virtual Machine
    HPVM: Heterogeneous Parallel Virtual Machine Maria Kotsifakou∗ Prakalp Srivastava∗ Matthew D. Sinclair Department of Computer Science Department of Computer Science Department of Computer Science University of Illinois at University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Urbana-Champaign [email protected] [email protected] [email protected] Rakesh Komuravelli Vikram Adve Sarita Adve Qualcomm Technologies Inc. Department of Computer Science Department of Computer Science [email protected]. University of Illinois at University of Illinois at com Urbana-Champaign Urbana-Champaign [email protected] [email protected] Abstract hardware, and that runtime scheduling policies can make We propose a parallel program representation for heteroge- use of both program and runtime information to exploit the neous systems, designed to enable performance portability flexible compilation capabilities. Overall, we conclude that across a wide range of popular parallel hardware, including the HPVM representation is a promising basis for achieving GPUs, vector instruction sets, multicore CPUs and poten- performance portability and for implementing parallelizing tially FPGAs. Our representation, which we call HPVM, is a compilers for heterogeneous parallel systems. hierarchical dataflow graph with shared memory and vector CCS Concepts • Computer systems organization → instructions. HPVM supports three important capabilities for Heterogeneous (hybrid) systems; programming heterogeneous systems: a compiler interme- diate representation (IR), a virtual instruction set (ISA), and Keywords Virtual ISA, Compiler, Parallel IR, Heterogeneous a basis for runtime scheduling; previous systems focus on Systems, GPU, Vector SIMD only one of these capabilities. As a compiler IR, HPVM aims to enable effective code generation and optimization for het- 1 Introduction erogeneous systems.
    [Show full text]
  • A Fast, Verified, Cross-Platform Cryptographic Provider
    EverCrypt: A Fast, Verified, Cross-Platform Cryptographic Provider Jonathan Protzenko∗, Bryan Parnoz, Aymeric Fromherzz, Chris Hawblitzel∗, Marina Polubelovay, Karthikeyan Bhargavany Benjamin Beurdouchey, Joonwon Choi∗x, Antoine Delignat-Lavaud∗,Cedric´ Fournet∗, Natalia Kulatovay, Tahina Ramananandro∗, Aseem Rastogi∗, Nikhil Swamy∗, Christoph M. Wintersteiger∗, Santiago Zanella-Beguelin∗ ∗Microsoft Research zCarnegie Mellon University yInria xMIT Abstract—We present EverCrypt: a comprehensive collection prone (due in part to Intel and AMD reporting CPU features of verified, high-performance cryptographic functionalities avail- inconsistently [78]), with various cryptographic providers able via a carefully designed API. The API provably supports invoking illegal instructions on specific platforms [74], leading agility (choosing between multiple algorithms for the same functionality) and multiplexing (choosing between multiple im- to killed processes and even crashing kernels. plementations of the same algorithm). Through abstraction and Since a cryptographic provider is the linchpin of most zero-cost generic programming, we show how agility can simplify security-sensitive applications, its correctness and security are verification without sacrificing performance, and we demonstrate crucial. However, for most applications (e.g., TLS, cryptocur- how C and assembly can be composed and verified against rencies, or disk encryption), the provider is also on the critical shared specifications. We substantiate the effectiveness of these techniques with
    [Show full text]
  • Productive High Performance Parallel Programming with Auto-Tuned Domain-Specific Embedded Languages
    Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages By Shoaib Ashraf Kamil A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Armando Fox, Co-Chair Professor Katherine Yelick, Co-Chair Professor James Demmel Professor Berend Smit Fall 2012 Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages Copyright c 2012 Shoaib Kamil. Abstract Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages by Shoaib Ashraf Kamil Doctor of Philosophy in Computer Science University of California, Berkeley Professor Armando Fox, Co-Chair Professor Katherine Yelick, Co-Chair As the complexity of machines and architectures has increased, performance tuning has become more challenging, leading to the failure of general compilers to generate the best possible optimized code. Expert performance programmers can often hand-write code that outperforms compiler- optimized low-level code by an order of magnitude. At the same time, the complexity of programs has also increased, with modern programs built on a variety of abstraction layers to manage complexity, yet these layers hinder efforts at optimization. In fact, it is common to lose one or two additional orders of magnitude in performance when going from a low-level language such as Fortran or C to a high-level language like Python, Ruby, or Matlab. General purpose compilers are limited by the inability of program analysis to determine pro- grammer intent, as well as the lack of detailed performance models that always determine the best executable code for a given computation and architecture.
    [Show full text]
  • Extracting and Mapping Industry 4.0 Technologies Using Wikipedia
    Computers in Industry 100 (2018) 244–257 Contents lists available at ScienceDirect Computers in Industry journal homepage: www.elsevier.com/locate/compind Extracting and mapping industry 4.0 technologies using wikipedia T ⁎ Filippo Chiarelloa, , Leonello Trivellib, Andrea Bonaccorsia, Gualtiero Fantonic a Department of Energy, Systems, Territory and Construction Engineering, University of Pisa, Largo Lucio Lazzarino, 2, 56126 Pisa, Italy b Department of Economics and Management, University of Pisa, Via Cosimo Ridolfi, 10, 56124 Pisa, Italy c Department of Mechanical, Nuclear and Production Engineering, University of Pisa, Largo Lucio Lazzarino, 2, 56126 Pisa, Italy ARTICLE INFO ABSTRACT Keywords: The explosion of the interest in the industry 4.0 generated a hype on both academia and business: the former is Industry 4.0 attracted for the opportunities given by the emergence of such a new field, the latter is pulled by incentives and Digital industry national investment plans. The Industry 4.0 technological field is not new but it is highly heterogeneous (actually Industrial IoT it is the aggregation point of more than 30 different fields of the technology). For this reason, many stakeholders Big data feel uncomfortable since they do not master the whole set of technologies, they manifested a lack of knowledge Digital currency and problems of communication with other domains. Programming languages Computing Actually such problem is twofold, on one side a common vocabulary that helps domain experts to have a Embedded systems mutual understanding is missing Riel et al. [1], on the other side, an overall standardization effort would be IoT beneficial to integrate existing terminologies in a reference architecture for the Industry 4.0 paradigm Smit et al.
    [Show full text]