LAFF-On Programming for High Performance

LAFF-On Programming for High Performance ulaff.net LAFF-On Programming for High Performance ulaff.net Robert van de Geijn The University of Texas at Austin Margaret Myers The University of Texas at Austin Devangi Parikh The University of Texas at Austin June 13, 2021 Cover: Texas Advanced Computing Center Edition: Draft Edition 2018–2019 Website: ulaff.net ©2018–2019 Robert van de Geijn, Margaret Myers, Devangi Parikh Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the appendix entitled “GNU Free Documentation License.” All trademarks™ are the registered® marks of their respective owners. Acknowledgements We would like to thank the people who created PreTeXt, the authoring system used to typeset these materials. We applaud you! This work is funded in part by the National Science Foundation (Awards ACI-1550493 and CCF-1714091). vii viii Preface Over the years, we have noticed that many ideas that underlie programming for high performance can be illustrated through a very simple example: the computation of a matrix-matrix multiplication. It is perhaps for this reason that papers that expose the intricate details of how to optimize matrix-matrix multiplication are often used in the classroom [2] [10]. In this course, we have tried to carefully scaffold the techniques that lead to a high performance matrix- matrix multiplication implementation with the aim to make the materials of use to a broad audience. It is our experience that some really get into this material. They dive in to tinker under the hood of the matrix-matrix multiplication implementation much like some tinker under the hood of a muscle car. Others merely become aware that they should write their applications in terms of libraries that provide high performance implementations of the functionality they need. We believe that learners who belong to both extremes, or in between, benefit from the knowledge this course shares. Robert van de Geijn Maggie Myers Devangi Parikh Austin, 2019 ix x Contents Acknowledgements vii Preface ix 0 Getting Started1 0.1 Opening Remarks..........................1 0.2 Navigating the Course.......................3 0.3 Setting Up..............................5 0.4 Experimental Setup........................ 16 0.5 Enrichments............................. 18 0.6 Wrap Up.............................. 19 1 Loops and More Loops 21 1.1 Opening Remarks.......................... 21 1.2 Loop Orderings........................... 26 1.3 Layering Matrix-Matrix Multiplication.............. 36 1.4 Layering Matrix-Matrix Multiplication: Alternatives...... 51 1.5 Enrichments............................. 61 1.6 Wrap Up.............................. 64 2 Start Your Engines 77 2.1 Opening Remarks.......................... 77 2.2 Blocked Matrix-Matrix Multiplication.............. 82 2.3 Blocking for Registers....................... 88 2.4 Optimizing the Micro-kernel.................... 99 2.5 Enrichments............................. 114 2.6 Wrap Up.............................. 121 3 Pushing the Limits 125 3.1 Opening Remarks.......................... 125 3.2 Leveraging the Caches....................... 130 3.3 Packing............................... 148 3.4 Further Tricks of the Trade.................... 160 3.5 Enrichments............................. 165 3.6 Wrap Up.............................. 168 xi xii CONTENTS 4 Multithreaded Parallelism 173 4.1 Opening Remarks.......................... 173 4.2 OpenMP............................... 182 4.3 Multithreading Matrix Multiplication.............. 188 4.4 Parallelizing More......................... 203 4.5 Enrichments............................. 211 4.6 Wrap Up.............................. 219 A 221 B GNU Free Documentation License 223 References 231 Index 235 Week 0 Getting Started 0.1 Opening Remarks 0.1.1 Welcome to LAFF-On Programming for High Performance YouTube: https://www.youtube.com/watch?v=CaaGjkQF4rU These notes were written for a Massive Open Online Course (MOOC) that is offered on the edX platform. The first offering of this course starts on June 5, 2019 at 00:00 UTC (which is late afternoon or early evening on June 4 in the US). This course can be audited for free or you can sign up for a Verified Certifi- cate. Register at https://www.edx.org/course/laff-on-programming-for-high-performance. While these notes incorporate all the materials available as part of the course on edX, being enrolled there has the following benefits: • It gives you access to videos with more convenient transcripts. They are to the side so they don’t overwrite material and you can click on a sentence in the transcript to move the video to the corresponding point in the video. • You can interact with us and the other learners on the discussion forum. • You can more easily track your progress through the exercises. Most importantly: you will be part of a community of learners that is also studying this subject. 0.1.2 Outline Week 0 Each week is structured so that we give the outline for the week immediately after the "launch:" 1 2 WEEK 0. GETTING STARTED • 0.1 Opening Remarks ◦ 0.1.1 Welcome to LAFF-On Programming for High Performance ◦ 0.1.2 Outline Week 0 ◦ 0.1.3 What you will learn • 0.2 Navigating the Course ◦ 0.2.1 What should we know? ◦ 0.2.2 How to navigate the course ◦ 0.2.3 When to learn ◦ 0.2.4 Homework ◦ 0.2.5 Grading • 0.3 Setting Up ◦ 0.3.1 Hardware requirements ◦ 0.3.2 Operating system ◦ 0.3.3 Installing BLIS ◦ 0.3.4 Cloning the LAFF-On-PfHP repository ◦ 0.3.5 Get a copy of the book ◦ 0.3.6 Matlab ◦ 0.3.7 Setting up MATLAB Online • 0.4 Experimental Setup ◦ 0.4.1 Programming exercises ◦ 0.4.2 MyGemm ◦ 0.4.3 The driver ◦ 0.4.4 How we use makefiles • 0.5 Enrichments ◦ 0.5.1 High-performance computers are for high-performance computing • 0.6 Wrap Up ◦ 0.6.1 Additional Homework ◦ 0.6.2 Summary 0.2. NAVIGATING THE COURSE 3 0.1.3 What you will learn This unit in each week informs you of what you will learn in the current week. This describes the knowledge and skills that you can expect to acquire. By revisiting this unit upon completion of the week, you can self-assess if you have mastered the topics. Upon completion of this week, you should be able to • Recognize the structure of a typical week. • Navigate the different components of LAFF-On Programming for High Performance. • Set up your computer. • Activate MATLAB Online (if you are taking this as part of an edX MOOC). • Better understand what we expect you to know when you start and intend for you to know when you finish. 0.2 Navigating the Course 0.2.1 What should we know? The material in this course is intended for learners who have already had an exposure to linear algebra, programming, and the Linux operating system. We try to provide enough support so that even those with a rudimentary exposure to these topics can be successful. This also ensures that we all "speak the same language" when it comes to notation and terminology. Along the way, we try to give pointers to other materials. The language in which we will program is C. However, programming as- signments start with enough of a template that even those without a previous exposure to C can be successful. 0.2.2 How to navigate the course Remark 0.2.1 Some of the comments below regard the MOOC, offered on the edX platform, based on these materials. YouTube: https://www.youtube.com/watch?v=bYML2j3ECxQ 4 WEEK 0. GETTING STARTED 0.2.3 When to learn Remark 0.2.2 Some of the comments below regard the MOOC, offered on the edX platform, based on these materials. The beauty of an online course is that you get to study when you want, where you want. Still, deadlines tend to keep people moving forward. There are no intermediate due dates but only work that is completed by the closing of the course will count towards the optional Verified Certificate. Even when the course is not officially in session, you will be able to use the notes to learn. 0.2.4 Homework Remark 0.2.3 Some of the comments below regard the MOOC, offered on the edX platform, based on these materials. You will notice that homework appears both in the notes and in the corresponding units on the edX platform. Auditors can use the version in the notes. Most of the time, the questions will match exactly but sometimes they will be worded slightly differently so as to facilitate automatic grading. Realize that the edX platform is ever evolving and that at some point we had to make a decision about what features we would embrace and what features did not fit our format so well. As a result, homework problems have frequently been (re)phrased in a way that fits both the platform and our course. Some things you will notice: • "Open" questions in the text are sometimes rephrased as multiple choice questions in the course on edX. • Sometimes we simply have the video with the answer right after a homework because edX does not support embedding videos in answers. Please be patient with some of these decisions. Our course and the edX platform are both evolving, and sometimes we had to improvise. 0.2.5 Grading Remark 0.2.4 Some of the comments below regard the MOOC, offered on the edX platform, based on these materials. How to grade the course was another decision that required compromise. Our fundamental assumption is that you are taking this course because you want to learn the material, and that the homework and other activities are mostly there to help you learn and self-assess.

LAFF-On Programming for High Performance

Reconfigurable Embedded Control Systems: Problems and Solutions

Logca: a High-Level Performance Model for Hardware Accelerators Muhammad Shoaib Bin Altaf ∗ David A

Foreign Library Interface by Daniel Adler Dia Applications That Can Run on a Multitude of Plat- Forms

Hardware, Firmware, Devices Disks Kernel, Boot, Swap Files, Volumes

Basics of UNIX

Oracle1® VM Virtualbox1® User Manual

Symbian Os 925

Unix History 2.6 Unix Geschichte 2.6 1972

In the GNU Fortran Compiler

Logtag Analyzer User Guide Page 2

Customized OS Kernel for Data-Processing on Modern Hardware

Operating Systems and Middleware: Supporting Controlled Interaction