Accelerating the “Motifs” in Machine Learning on Modern Processors

Accelerating the \Motifs" in Machine Learning on Modern Processors Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Jiyuan Zhang B.S., Electrical Engineering, Harbin Institute of Technology Carnegie Mellon University Pittsburgh, PA December 2020 c Jiyuan Zhang, 2020 All Rights Reserved Acknowledgments First and foremost, I would like to thank my advisor | Franz. Thank you for being a supportive and understanding advisor. You encouraged me to pursue projects I enjoyed and guided me to be the researcher that I am today. When I was stuck on a difficult problem, you are always there to offer help and stand side by side with me to figure out solutions. No matter what the difficulties are, from research to personal life issues, you can always consider in our shoes and provide us your unreserved support. Your encouragement and sense of humor has been a great help to guide me through the difficult times. You are the key enabler for me to accomplish the PhD journey, as well as making my PhD journey less suffering and more joyful. Next, I would like to thank the members of my thesis committee { Dr. Michael Garland, Dr. Phil Gibbons, and Dr. Tze Meng Low. Thank you for taking the time to be my thesis committee. Your questions and suggestions have greatly helped shape this work, and your insights have helped me under- stand the limitations that I would have overlooked in the original draft. I am grateful for the advice and suggestions I have received from you to improve this dissertation. This thesis would not have been possible without the guidance and patience of you. In addition, I am extremely grateful for the endless support from the iii Spiral group and colleagues at A-Level. Thank you professor James Hoe. You are such a wonderful professor to have at A-level. Your encouragement for us to think critically and to express our thoughts no matter how immature they are, has been a big encouragement for me. I am also extremely thankful for professor Tze Meng Low. It was such a luck to have you in the group when I just started graduate school. You are a great help in my research journey. Those discussions with you were very inspirational for a early-year PhD. They definitely shaped my understanding of research and taught me how to be a responsible researcher. I would also like to thank Dr. Scott McMillan for those helpful discussions and advice on my research and guiding me into the world of graph-related motifs. I cannot adequately express how thankful I am. Additionally, thank you my wonderful colleagues at A-level | Richard, Thom, Daniele, Fazle, Marie, Joe, Guanglin, Zhipeng, Ke, Senbo, Anuva. My experience in the graduate school was greatly enhanced as I was surrounded by so many wonderful friends and colleagues. I would never forget those research-related and non-research-related discussions with you guys. It has been so much fun and thought-provoking. Also, thank you my friends | Yi Lu, Zhuo Chen, Yang Gao, Qichen Huang, Chenglei Fang, Yang li, for being great friends, the source of great emotional support. I would also like to thank my funding resources. The research in this thesis is based upon work funded and supported by DARPA under agreement HR0011-20-9-0018, DE-AC02-05CH11231, HR00111320007 and FA87501220291. This thesis is also based upon work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering In- stitute, a federally funded research and development center. iv Last but not least, I owe an especially deep gratitude to my parents for their never-ending love, patience, and encouragement. Jiyuan Zhang Carnegie Mellon University December 2020 v Abstract vi The booming growth in AI and machine learning is drastically reshap- ing the landscape of high performance computing. Traditional HPC addresses scientific problems that are driven by simulation, modeling and analysis in science domains. Algorithms like linear algebra and methods like differential equations are at the core of solutions to such problems. However, emerging machine learning tasks rest on a new set of algorithms and models. The computation and data movement patterns inherent in learning algorithms cannot be directly mapped to the computation motifs in the physics, chemistry, or biology simulation problems. As a result, the high performance libraries that originate in the traditional scientific domains thus cannot be straightforwardly applied to emerging ML tasks to deliver required performance. This thesis focuses on performance optimizations of computation kernels in emerging machine learning applications, spanning across a diverse range from dense, regular to sparse and irregular kernels. In this work, we demonstrate how code specialization and generation together with expert-built performance models and learned dispatch strategies can together enable ML motifs to achieve better performance on modern processors. First, we investigate the performance optimization of dense kernels with a focus on the convolutional neural networks (CNN). The computation of convolution layers in deep neural networks typically relies on high performance matrix-multiplication routines to improve performance. However, these routines are not optimized for performing convolution. Extra memory overhead is incurred due to data transformation, and the performance obtained is also less than conventionally expected. We demonstrate that direct convolution, when implemented correctly, eliminates all memory overhead, and yields performance that is between 10% to 4x better than existing high performance im- vii plementations on conventional and embedded CPU architectures. We present a model-guided optimization approach which utilizes the characteristics of sys- tem architectures to guide the optimization choices of loop ordering, blocking, and memory layout transformation. We show that a high performance direct convolution exhibits better performance scaling than expert-tuned matrix implementation, i.e. suffers less performance drop, when increasing the number of threads. Sparse kernel is an equally important computation kernel appearing in many machine learning applications such as graph analytics and genetic sequencing. One factor that prevents sparse kernels from achieving high performance on modern processors results from the prohibitively large number of different implementations and data structures for sparse problems. We start with the observation that the complicated sparse computations can be distilled into primitive set of operators such as join, merge, and difference. To acceler- ate those operators on modern processors with data parallelism, we propose a vectorization and code specialization approach which can eliminate the control divergences of these operators. Next, we explore the design space for vectorization on CPUs with various vector width, based on which we present the code generation algorithm that takes the data width and operations as input and generates various implementations. We then demonstrate the acceleration of the General Sparse Matrix-Matrix Multiplication (SpGEMM) on GPUs. We show how the SpGEMM implementation can leverage join/merge operators to compose a variety of implementations. Another challenge when optimizing sparse kernels is that their performance behavior is data dependent, while the input characteristics may change online during iterative updates. To leverage the different implementations offered by the code generator, we propose a low- viii overhead mechanism that collects the data characteristic information to learn online dispatch decisions over iterations. Overall, in this thesis, we demonstrate the interplay of code specialization and generation, together with performance modeling, learned dispatch, can enable high performance kernels for the emerging machine learning applications. ix Contents Acknowledgments iii Abstract vi List of Figures xv List of Tables xxi Chapter 1 Introduction1 1.1 Motivation.............................1 1.2 Preliminaries...........................4 1.2.1 Domains and Computation Kernels...........4 1.2.2 Frameworks and Libraries................5 1.2.3 Code Generation.....................8 1.2.4 Performance Optimization................8 1.2.5 Hardware Architectures.................9 1.3 Dense Computational Kernels in Neural Networks....... 10 1.4 Sparse Computational Kernels in Data Analytics....... 12 x 1.5 Thesis Overview.......................... 14 Chapter 2 Accelerating Dense Convolutions 17 2.1 Overview.............................. 18 2.2 Introduction............................ 20 2.3 Background............................ 23 2.4 High Performance Direct Convolution.............. 27 2.4.1 Strategy for Mapping Loops to Architecture...... 27 2.4.2 Parallelism......................... 33 2.5 Convolution-Friendly Data Layout................ 34 2.5.1 Input/Output Layout................... 34 2.5.2 Kernel Layout....................... 35 2.5.3 Backward Compatibility................. 36 2.6 Experiments............................ 38 2.6.1 Experimental Setup.................... 38 2.6.2 Performance........................ 39 2.6.3 Parallel Performance................... 40 2.7 Chapter Summary........................ 43 Chapter 3 Sparse Computational Kernels: A Preliminary Study 45 3.1 Expressing Computation Kernels in Sparse Applications... 47 3.2 Accelerating Sparse Computation: A Case Study with Triangle Counting.............................

Accelerating the “Motifs” in Machine Learning on Modern Processors

ARM HPC Ecosystem

0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Introduchon to Arm for Network Stack Developers

A Systematic Approach for Obtaining Performance on Matrix-Like Operations Submitted in Partial Fulfillment of the Requirements

Arxiv:1611.01120V1 [Cs.MS] 3 Nov 2016

Implementing Strassen-Like Fast Matrix Multiplication Algorithms with BLIS

BLAS and LAPACK Runtime Switching

Implementing Strassen's Algorithm with BLIS

Profiling and Inspecting Linear Algebra Applications Using Flexiblas

0 the BLIS Framework: Experiments in Portability

A Case for Malleable Thread-Level Linear Algebra Libraries: the LU Factorization with Partial Pivoting