Research Collection

Research Collection Doctoral Thesis Building Abstractions for Staged DSLs in Performance-Oriented Program Generators Author(s): Stojanov, Alen Publication Date: 2019-05 Permanent Link: https://doi.org/10.3929/ethz-b-000372536 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library alen stojanov BUILDINGABSTRACTIONSFORSTAGEDDSLSIN PERFORMANCE-ORIENTEDPROGRAMGENERATORS diss. eth no. 26058 BUILDINGABSTRACTIONSFORSTAGEDDSLS INPERFORMANCE-ORIENTEDPROGRAM GENERATORS A dissertation submitted to attain the degree of doctor of sciences of eth zurich (Dr. sc. ETH Zurich) presented by alen stojanov Dipl., Eidgenössisches Polytechnikum born on 1 September 1987 citizen of North Macedonia and Bulgaria accepted on the recommendation of Prof. Dr. Markus Püschel, examiner Prof. Dr. Tiark Rompf, co-examiner Prof. Dr. Zhendong Su, co-examiner 2019 Alen Stojanov: Building Abstractions for Staged DSLs in Performance-Oriented Program Generators, © 2019 To Gordana, my mother. To this very day, you would still say: “Ajde Alen, xto qekax?” I will forever remember your words. These simple words, that hold the meaning of: “Come on, Alen, what are you waiting for?”, have been the driving force throughout this journey. Well mom, I won’t be waiting anymore. Abstract Developing high-performance code for numerical domains is challenging, as it requires hand-in-hand specialization with the continuous evolution of modern hardware. Program generators based on domain-specific languages (DSLs) can provide a solution to the challenge of re-specializing programs and libraries as new architectures emerge. However, such code generators are difficult to design as they require DSLs that reason about high-level mathematical domains, and employ analysis and transformation steps to map these domains to low-level hardware instructions. The difficulty comes from both the problem of designing an extensible approach to perform all the optimizations the compiler is unable to do and the actual implementation of the generator. The Lightweight Modular Staging (LMS) framework delivers the promise of “abstracting without regret” through the use of multi-stage programming (MST). LMS implements MST based on types and has shown to be an effective tool for designing and staging DSLs, providing means to abstract DSLs and thus improve the state of generators. The work in this thesis focuses on the design of abstractions for staged DSLs that simplify analyses and transformations to deliver high performance. First, we demonstrate a systematic approach to enable data parallelism, by automatically generating low-level, instruction set architecture (ISA) DSLs, providing complete support from MMX to AVX-512 ISAs. We then show the immediate benefit of these DSLs when used in managed language runtimes as embedded DSLs, allowing for tight integration with the host language and significant improvements in performance. Then, we use the low-level DSLs as building blocks for low-level abstractions to provide an efficient implementation of variable-precision computations on state-of-the-art CPUs. Further, we use the concept of stage polymorphism to provide abstractions for data parallelism and perform all relevant optimizations in a prototypical generator for high-performance convolutional operations. Finally, we propose the metacontainer abstraction, to stage a more general, dynamically typed numerical DSL in the LMS framework, which relies heavily on static type information. We demonstrate this work by staging a subset of MATLAB for compilation to low-level C code, and use the metacontainer abstraction to simplify type and shape inference, and handle many dynamical aspects of the language. vii Résumé Le développement de programmes de calcul numérique à haute performance est délicat, car il nécessite une spécialisation qui doit suivre l’évolu- tion constante du matériel informatique. Les générateurs de programmes basés sur des langages dédiés (DSLs) répondent à ce problème en permettant d’adapter directement un programme ou une bibliothèque à une nouvelle architecture matérielle aussitôt qu’elle apparait. Ces générateurs sont néanmoins complexes, car ils nécessitent des DSLs capables de repré- senter des concepts mathématiques de haut niveau, et différentes étapes de manipulation et de transformation pour produire les instructions ma- térielles de bas niveau correspondantes. Deux difficultés se posent alors ; d’une part trouver une approche modulaire pour effectuer toutes les optimisations qu’un compilateur ne serait pas capable d’effectuer, et d’autre part implémenter un tel générateur. Lightweight Modular Staging (LMS) est une forme de programmation par étage (PPE) qui permet de réaliser une « abstraction sans regrets ». LMS implémente le PPE via le typage, et se révèle être un très bon outil pour implémenter les DSLs, permettant ainsi de faciliter le développement des différents étages des générateurs de code. Dans cette dissertation, nous étudions la conception des abstractions des DSLs à étage, pour simplifier les analyses et les transformations qui conduisent à la production de code à haute performance. Nous présentons d’une part la génération automatique de DSLs de bas niveau, représentant des jeux d’instructions (ISA) complets, qui peuvent s’étendre du MMX à l’AVX-512, permettant ainsi la programmation en parallèle. Nous montrons ensuite les avantages directs que procurent ces DSLs lorsqu’ils sont implé- mentés au sein d’une machine virtuelle de haut niveau, permettant ainsi une intégration fine au langage hôte, et une augmentation significative des performances. Nous utilisons ensuite les DSLs de bas niveau comme briques de base pour l’abstraction de bas niveau, permettant ainsi une implémentation efficace de calcul numérique à précision variable sur des processeurs récents. Par ailleurs, nous utilisons le concept de polymor- phisme à étage pour permettre des abstractions relatives au parallélisme et les optimisations correspondantes dans un prototype de générateur de convolutions à haute performance. Enfin, nous introduisons le concept de méta-conteneur, qui permet de proposer des DSL numériques avec un typage dynamique plus général que celui initialement proposé par LMS, ix qui exploite principalement les informations obtenues par un typage sta- tique. Nous montrons l’intérêt de ce nouveau concept en implémentant un générateur à étage qui transforme un sous-ensemble de MATLAB en un code C de bas niveau. Les méta-conteneurs permettent alors de simplifier l’inférence de type et de forme, et permettent de gérer plusieurs aspects dynamiques du langage. x Acknowledgements First, I like to express my utmost gratitude to my advisor Prof. Dr. Markus Püschel, for all his support, patience and guidance; for helping me grow as a scholar and for teaching me to perform scientific research with the highest level of detail and devotion. I would like to thank my co-advisor Prof. Dr. Tiark Rompf for being supportive and always ready to give insightful feedback for my work, and for helping me improve myself as a researcher. I have greatly enjoyed our collaboration, which I hope we continue in the future. I would like to thank Prof. Dr. Zhendong Sun for agreeing to be part of my PhD committee. I would like to thank Prof. Dr. Michael Franz. The short collaboration, during the work of my master thesis, had a profound impact on my carrier and eventually lead to the decision of pursuing a PhD degree. I had the pleasure and luck to work with many great collages and would like to thank all my collaborators: Francois, Georg, Daniele, Victoria, Luca, Michael, Remi, Gagandeep and Tyler. It was such a rewarding experience working with all of you. I like to thank Ivaylo Toskov, for choosing me as his master thesis co- advisor and for the impeccable results that his work delivered. I would also like to use this opportunity to thank my closest friends: Jana, Darko, Vladimir, Mihajlo, Manuel, Kaveh, Endri, Kaan, Rostislav, Krenar, Eray, Ivana, Viktor, Riste, Mite, Ace, Jovanche, Ilija, Petre and Sanja. Your friendship has been an immense pleasure throughout all these years. I doubt that I will ever find enough words to thank my parents Gordana and Timko. You did an exceptional job to raise me up as a person and to set me on the right path. None of my success would ever materialized, without your presence as role models in my life. Finally, I like to thank my wife, Katerina. Thank you for the courage that you lend me, for believing in me and for being there for me all these years. Much of my success is thanks to you. xi Contents 1 introduction . 1 1.1 Goal of This Dissertation . 5 1.2 Contributions of this Dissertation . 6 1.3 Background . 7 1.3.1 Scala . 8 1.3.2 Lightweight Modular Staging . 14 1.4 Organization of the Dissertation . 17 2 vectorization in LMS . 19 2.1 Motivation . 20 2.2 Background . 22 2.2.1 Intrinsics . 22 2.2.2 Java Virtual Machines . 23 2.3 SIMD Intrinsics on Managed Language Runtimes . 25 2.3.1 Automatic Generation of ISA-specific DSLs . 26 2.3.2 Developing Explicitly Vectorized Code in the JVM . 32 2.4 Results . 34 2.4.1 Evaluation . 34 2.4.2 Limitations . 40 2.5 Related Work . 41 2.6 Summary . 42 3 low Level Abstractions . 43 3.1 Motivation . 44 3.2 Background . 45 3.3 Low Precision Arithmetic . 47 3.3.1 4-bit Arithmetic . 48 3.3.2 8, 16 and 32-bit Arithmetic . 54 3.3.3 Low-Level Abstraction in LMS . 55 3.4 Evaluation . 58 3.4.1 Performance of Individual Routines . 59 3.4.2 Evaluation of the Low-Level Abstractions in LMS . 60 3.4.3 Evaluation of Quantized IHT and Quantized GD . 62 3.5 Related Work . 64 3.6 Summary . 65 4 abstracting Data Parallelism using Staged Polymorphism . 67 4.1 Motivation . 68 4.2 FGen . 69 4.2.1 Tiling .

Research Collection

CS 110 Discussion 15 Programming with SIMD Intrinsics

PGI Compilers

Intel Hardware Intrinsics in .NET Core

Optimizing Subroutines in Assembly Language an Optimization Guide for X86 Platforms

Automatic SIMD Vectorization of Fast Fourier Transforms for the Larrabee and AVX Instruction Sets

Micro Focus Visual COBOL 6.0 for Visual Studio

In the GNU Fortran Compiler

Micro Virtual Machines: a Solid Foundation for Managed Language Implementation

Tricore C Compiler, Assembler, Linker Reference Manual

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms

Intrinsic Functions ►Development Tools ►Performance and Optimizations

Power Vector Intrinsic Programming Reference