The Concept of Transprecision Computing

Diss. ETH No. 26716 The concept of transprecision computing A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich) presented by FLORIAN MICHAEL SCHEIDEGGER MSc ETH EEIT, ETH Zurich born on April 4th, 1990 citizen of Unterramsern SO, Switzerland accepted on the recommendation of Prof. Dr. Luca Benini, examiner Prof. Dr. Norbert When, co-examiner Dr. Cristiano Malossi, supervisor 2020 Acknowledgments First of all, I would like to thank my supervisors, Luca Benini and Cristiano Malossi. I appreciate that I got the exceptional chance to pursue a doctorate within IBM Research Zurich. In that context, I thank Luca for his prompt and constructive feedback and his guidance to higher-level goals. But most importantly, I thank Luca for opening the doors of that career path in its initial stages — without that, I would not be in my current position. I would like to thank Cristiano for his endless support, his detailed feedback, and fruitful discussions. Especially, I thank Cristiano for practical help and guidance required in the daily business, which includes improving presentation skills up to successfully leading me through a complete process of filing a disclosure as the first author. I would like to thank Costas Bekas, our group manager, for pushing through my contract and for providing motivating and constructive feedback to our team. I would also like to thank Norbert Wehn, the co-examiner of this thesis, for his interest in my work and his feedback. I thank all my collaborators, Goran Flegar, Vedran Novakovic, Giovani Mariani, Andres Tomas, Enrique Quintana-Ortí, Lukas Cav- igelli, Michael Schaffner, Roxana Istrate, Dimitrios Nikolopoulos, Youri Popoff, Michael Gautschi, Frank Gürkaynak, Hubert Kaeslin, Aljoscha Smolic, Manuel Eggimann, Christelle Gloor, Atin Sood, Benjamin El- der, Benjamin Herta, Chao Xue, Debashish Saha, Ganesh Venkatara- man, Gegi Thomas, Hendrik Strobelt, Horst Samulowitz, Martin Wis- tuba, Matteo Manica, Mihir Choudhury, Rong Yan, Ruchir Puri, and Tejaswini Pedapati for constructive discussions and inspiration that caused joint contributions. iii iv I thank all members of the OPRECOMP consortium for impacting our research. Especially, I would like to thank Enrique Quintana-Ortí, Andrew Emerson, Fabian Schuiki, Stefan Mach, Guiseppe, Davide Rossi, Giuseppe Tagliavini, Andrea Borghesi, JunKyu Lee, Umar Min- has, Igor Neri, Eric Flamand, and Christian Weis for their valuable project input, critical feedback and long discussions that help to shape this work. I would like to thank additional team members at IBM, Nico Gorbach, Panagiotis Chatzidoukas, and Andrea Bartezzaghi that were involved in productive technical discussions. A special thank is dedicated to Roxana with whom I shared the office, for technical discussions, for helping me out with many practical problems, and for the joint work that we authored. I would like to thank Luca’s research group and IBM’s depart- ment for providing the facility and required administrative help. A special thank is dedicated to Christoph Hagleitner for managing and providing IBM’s computing infrastructure to researchers. I highly appreciate the proficient help of Lilli-Marie Pavka for copy-editing our latest publication. I thank Dionysios Diamantopoulos for interesting technical out-of-the-box discussions and shared OPRECOMP project efforts. I would like to thank Folashade Ajala, Austin Brauser, and Hari Mistry for volunteering to proofread fragments of this work. Finally, I would like to thank my family and my friends for sup- porting me during this time. I highly appreciate the rooted to the soil inspiration of Sarah Battige that calmed me down in stressful times towards the end of writing this work. I am in particular grateful to my parents Marianne and Thomas Scheidegger who provide uncondi- tional support in any life situations. Especially, I would like to thank them for encouraging my interests and for providing me access to the university in the first place that enabled my career path. Abstract For many years, computing systems rely on guaranteed numerical precision of each step in complex computations. Moore’s law sustains exponential improvements in the semiconductor industry over several decades for building computing infrastructure, from tiny Internet-of- Things nodes, over personal smartphones, laptops or workstations, up to large high performance computing (HPC) computing server centers. With the paradigm of the ”power wall”, achievable improvements start to saturate. To that end, the concept of transprecision computing emerged, where existing over-conservative ”precis” computing assumptions are relaxed and replaced with more flexible and efficient policies to gain performance. Unfortunately, it is non-straight forward to adopt and integrate general transprecision concepts into the variety of today’s computing infrastructure. The main challenge consists of leveraging domain- specific knowledge and provide full solutions covering from physical foundations over circuit-level up through the full software stack to the application level. This work focuses on how transprecision concepts improve general computing. We identify and elaborate the standard number repre- sentations, especially the one defined in the IEEE 754 floating-point standard, as the enabler of low precision computing. We developed lightweight libraries that allow integrating transprecision concepts into algorithms. Finally, we focus on building automatized workflows for specific problems, where the solution space is enlarged by multiple orders of magnitude due to the various configurations of low precision. We demonstrate how heuristic optimization strategies applied on top v vi of transprecision computing find near to optimal configurations of approximated kernels in a short time. Zusammenfassung Seit vielen Jahren beruhen komplexen Berechnungen von Computer Systemen auf garantierter numerischer Präzision in jedem Schritt. Das Mooresche Gesetz (engl. Moore’s law) erhält exponentielles Wachstum in der Halbleiterindustrie über mehrere Jahrzehnte aufrecht. Dadurch wird ein kontinuierlicher Fortschritt von Computer Infrastruktur — vom Internet der Dinge, Smartphones, Laptops, Arbeitsplatzrechnern, bis hin zu Hochleistungsrechnern in Rechenzentren — erreicht. Das Paradigma der Grenzen der Leistungsaufnahme (engl. power wall) limitiert erreichbare Verbesserungen. Das Konzept von Transprecision Computing lockert existierende und zu konservative Annahmen be- züglich der Rechengenauigkeit. Stattdessen werden Annahmen durch flexiblere und effizientere Richtlinien ersetzt um die Rechenleistung zu verbessern. Leider ist es nicht einfach Transprecision Konzepte direkt in die Vielfalt der heutigen Computer Systeme zu integrieren. Die grösste Herausforderung besteht darin, Gebiet spezifisches Wissen wirksam einzusetzen, um eine komplette Lösung zu erreichen welche alle Aspek- te — von physikalischen Grundlagen, Schaltungsdetails, Software, bis hin zu Anwendungsgegebenheiten — berücksichtigt. Diese Arbeit fokussiert wie durch Transprecision Konzepte all- gemeine Berechnungen verbessert werden können. Wir identifizieren und etablieren, dass die Standard Repräsentation von Zahlen, insbe- sondre jene des IEEE 754 floating-point Standards, das Rechnen mit vii viii reduzierter Genauigkeit ermöglichen. Wir entwickeln schlanke Soft- warebibliotheken welche die Integration von Transprecision Konzep- te in Algorithmen ermöglichen. Schlussendlich bilden wir automati- sierte Arbeitsabläufe für spezifische Problemstellungen welche auf- grund der vielen Konfigurationen der einstellbaren Genauigkeit in ei- nem, um mehrere Grössenordnungen erweiterten, Lösungsraum liegen. Wir zeigen, wie heuristische Optimierungsstrategien — angewandt auf Genauigkeits-Konfigurationen von Transprecision Berechnungen — nahezu optimale Konfigurationen der approximierten Kernen in kurzer Zeit liefern. Contents 1 Introduction 1 1.1 Trends in deep learning . .5 1.2 Research directions in deep learning . .8 1.3 Contributions . 14 1.4 Thesis structure . 16 1.5 List of publications . 18 2 Benchmarking today’s systems 21 2.1 Benchmarks for transprecision computing . 22 2.1.1 PageRank . 24 2.1.2 BLSTM . 25 2.1.3 GLQ . 28 2.2 Summary and conclusion . 29 3 Approximate computing 31 3.1 Approximate computing techniques . 33 3.1.1 The use of datatypes . 33 3.1.2 Loop perforation . 34 3.1.3 Task skipping and memoization . 36 3.1.4 Using multiple inexact program versions . 37 3.1.5 Stochastic computing . 38 3.2 Applying approximate computing . 40 3.3 Selected results on benchmarks . 43 3.3.1 PageRank . 43 3.3.2 BLSTM . 48 3.3.3 GLQ . 51 ix x CONTENTS 3.4 Summary and conclusion . 51 4 Core transprecision concepts 55 4.1 Number formats . 55 4.1.1 Fixed-point . 56 4.1.2 IEEE 754 floating-point . 57 4.1.3 Logarithmic number system (LNS) . 59 4.2 The transprecision system view . 60 4.2.1 Transprecision concepts . 63 4.2.2 Reduced precision as root-cause . 66 4.2.3 Transprecision computing in current solutions . 68 4.3 Summary and conclusion . 70 5 Emulating numerical behavior of applications 73 5.1 The floatx library . 74 5.1.1 Related work . 74 5.1.2 Interface and design goals . 75 5.1.3 The choice of C++ . 77 5.1.4 The floatx class template . 78 5.1.5 Operations on floatx objects . 81 5.1.6 The floatxr class template . 85 5.1.7 Notes on concurrency . 86 5.1.8 Advanced properties and performance of floatx 87 5.2 Numerical analysis of applications . 89 5.2.1 PageRank . 89 5.2.2 BLSTM . 93 5.2.3 GLQ . 96 5.3 Summary and conclusion . 100 6 Floatx for

Load more