Performance Analysis for Machine Learning Applications

Performance Analysis for Machine Learning Applications The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Wang, Yu. 2020. Performance Analysis for Machine Learning Applications. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences. Citable link https://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37365132 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Performance Analysis for Machine Learning Applications a dissertation presented by Yu Wang to The Department of School of Engineering and Applied Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Computer Science Harvard University Cambridge, Massachusetts Sep 2019 ©2019 – Yu Wang all rights reserved. Thesis advisor: Professor David Brooks and Gu-Yeon Wei Yu Wang Performance Analysis for Machine Learning Applications Abstract Performance analysis has been driving the advancements of software and hardware systems for decades. Proper analysis can reveal system and architectural bottlenecks, provide essential infor- mation for choosing frameworks and platforms, and further lead to performance optimizations. Systematic performance analysis is difficult since it involves various problem domains, algorithms, software stacks, systems, and hardware. Recently, the surge of machine learning applications and domain-specific architectures stimulates rapid evolution of all those dimensions, thus making performance analysis increasingly challenging. To tackle this problem, this dissertation conducts deep and systematic performance analysis for a variety of workloads, software frameworks, and hardware systems, and demonstrates proper ways to apply several performance analysis methodologies. First, we study the performance analysis methodology for general-purpose processors, a traditional and mature industry, based on CPU benchmarks and demonstrate the importance of using representative benchmarks in drawing insights about hardware systems. Next, with the lessons learned from traditional methods, we rigorously study deep learning frameworks by applying proper analysis methods at corresponding scenarios. We extract the performance implications of key design features from those frameworks, and the insights are distilled into a set of simple guidelines to tune framework features. The proposed guidelines nearly close the performance gap between the state of the art and the global optimum. Further, we pro- pose a systematic methodology to facilitate performance analysis for rapidly evolving deep learning models and platforms. The proposed methodology can reveal deeper insights that are difficult to discover for traditional approaches. We demonstrate its utility with deep learning by comparing two generations of specialized hardware (Google’s Tensor Processing Unit v2 and v3), three heteroge- neous platforms (TPU, GPU, and CPU), and different versions of specialized software (TensorFlow and CUDA). Finally, since machine learning techniques advance rapidly and architects need to be aware of emerging applications, we take the first step towards analyzing Bayesian inference, an im- portant branch of machine learning. Optimization mechanisms are proposed based on the analysis. With the methodologies and analysis presented in this dissertation, we hope to encourage re- searchers and engineers to apply our methodologies to new platforms, software, and applications for systematic performance analysis. We envision this to help resolve existing performance bottlenecks and design better software and hardware for the current and future applications. iii Contents 1 Introduction and Background 1 1.1 The Role of Performance Analysis ......................... 2 1.2 Challenges of Analysis for Deep Learning ..................... 4 1.3 Bayesian Inference ................................. 5 1.4 Thesis Overview .................................. 6 1.5 Thesis Contributions ................................ 9 2 General-Purpose Platforms and Workloads 13 2.1 Introduction .................................... 14 2.2 Intel Processor Statistics .............................. 16 2.3 Methodology .................................... 23 2.4 Case Studies Overview ............................... 30 2.5 Prediction Accuracy ................................ 33 2.6 SKU Ranking Comparison ............................. 40 2.7 Related Work .................................... 46 2.8 Summary ...................................... 48 3 Parallelism in Deep Learning Frameworks 49 3.1 Introduction .................................... 50 3.2 Framework Design Overview ............................ 53 3.3 Experimental Setup ................................. 58 3.4 Scheduling Mechanism ............................... 60 3.5 Operator Design .................................. 68 3.6 Library Choice ................................... 77 3.7 Beyond One Socket ................................. 82 3.8 Framework Design Tuning ............................. 86 3.9 Summary ...................................... 90 4 A Systematic Methodology for Performance Analysis 91 4.1 Introduction .................................... 92 4.2 Methodology .................................... 94 iv 4.3 Hardware Platforms ................................ 100 4.4 TPU Performance Implications . 102 4.5 Cross-Platform Comparison ............................ 114 4.6 Software Stack Advances .............................. 121 4.7 Limitations ..................................... 124 4.8 Related Work .................................... 125 4.9 Summary ...................................... 126 5 Demystifying Bayesian Inference Workloads 127 5.1 Introduction .................................... 128 5.2 Bayesian Inference ................................. 131 5.3 BayesSuite: Bayesian Inference Workloads . 135 5.4 Performance Analysis ................................ 138 5.5 Bottleneck Resolution ............................... 143 5.6 Algorithm Convergence .............................. 147 5.7 Implications for Future Acceleration . 153 5.8 Related Work .................................... 156 5.9 Summary ...................................... 158 6 Future Directions 159 6.1 Deep Learning ................................... 160 6.2 Bayesian Inference ................................. 162 v Listing of figures 1.1 Increasing Bayesian inference publications in top machine learning conferences from 2007 to 2016. .................................... 6 2.1 The number of SKUs released by Intel for 17 microarchitectures between 1993 and 2016. (Data from http://ark.intel.com.) ......................... 17 2.2 Means and standard deviations of SPEC relative run times. 18 2.3 SKU breakdown in the SPEC (top) and Geekbench (bottom) datasets. 21 2.4 Means and standard deviations of SPEC workloads’ relative run times on the SKUs in our dataset. ................................... 21 2.5 Principal component analysis of SPEC workloads’ performance scaling on 639 config- urations of 352 SKUs. The green stars are centroids of two clusters identified by k-means. The outlier near the top is 481.wrf. ......................... 22 2.6 Results of predicting the performance of SPEC (top) and Geekbench (bottom) on new SKUs. ....................................... 33 2.7 The results of case 2. The workloads are sorted by the MAE after adding 50 SKUs. 35 2.8 Correlation of SPEC workloads’ outlierness (distance) with MAE after adding 50 SKUs. The distance of a workload is its distance from the the centroid in its cluster. Centroids are in Figure 2.5. .................................. 37 2.9 Case 3 is about cross-prediction. SPEC workloads are predicted with the Geekbench dataset (top). Geekbench workloads are predicted with the SPEC dataset (bottom). Cross- prediction has higher errors than self-prediction (case 2). 39 2.10 SKU ranking results of case 1. Our DNN model has the highest top-3 accuracy. 41 2.11 Based on SPEC (top) and Geekbench (bottom), compare the SKU rankings by self-prediction (case 2), cross-prediction (case 3) and the best baseline (Frequency). 42 2.12 Comparison of SKU rankings based on SPEC and Geekbench averages. 43 3.1 Time breakdown for Inception v3 training. .................... 51 3.2 An overview of the framework design features studied in this chapter. 52 3.3 Examples of (a) synchronous scheduling, (b) asynchronous scheduling, and (c) using one and four thread pools, with the same total hardware resources. 60 vi 3.4 (Bar Chart) The speedup of using asynchronous scheduling over synchronous. (Ta- ble) The maximum computational graph width and best numbers of thread pools. Work- loads with more branches benefit from more pools. 61 3.5 Asynchronous scheduling benefits large-batch inference and small-batch training. 63 3.6 (a) Inception v2 architecture contains modules with (b) four and (c) three independent branches. Area 1 exhibits inter- and intra-op parallelism and area 2 only intra-op. 64 3.7 Performance of Inception v2 with different numbers of inter-op pools and MKL threads per pool. Best configuration balances intra- and inter-op parallelism. 65 3.8 Execution time breakdown of four cases. ...................... 66 3.9 Execution traces of three cases in Figure 3.8. Color-coded areas 1 and 2 correspond to

Performance Analysis for Machine Learning Applications

Security Analysis of Docker Containers in a Production Environment

Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

A Practical Guide for Improving Transparency and Reproducibility in Neuroimaging Research Krzysztof J

Overview of the SPEC Benchmarks

Reproducible Reports with Knitr and R Markdown

Fault Tolerance and Re-Training Analysis on Neural Networks

Linux Kernal II 9.1 Architecture

Firecracker: Lightweight Virtualization for Serverless Applications

In-Datacenter Performance Analysis of a Tensor Processing Unit

Abstractions for Programming Graphics Processors in High-Level Programming Languages

CMSC 411: Computer Architecture

3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) Zero Ovf