GPU Parallelization, Validation, and Characterization of the Tensor Template Library
Total Page:16
File Type:pdf, Size:1020Kb
GPU Parallelization, Validation, and Characterization of the Tensor Template Library Alexander C. Winter A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Engineering University of Washington, 2019 Committee: Chair: Andrew Lumsdaine Duane Storti Jeff Lipton Program Authorized to Offer Degree: Mechanical Engineering ©Copyright 2019 Alexander C. Winter University of Washington Abstract GPU PARALLELIZATION, VALIDATION, AND CHARACTERIZATION OF THE TENSOR TEMPLATE LIBRARY Alexander C. Winter Chair of the Supervisory Committee: Dr. Andrew Lumsdaine Department of Computer Science and Engineering Previous work has developed a tool, the Tensor Template Library (TTL), which uses variadic expression template metaprogramming to capture tensor behaviors clearly and in a manner resembling the mathemat- ical abstraction engineers are familiar with while concealing the cumbersome looping structures, in an op- timized manner. This has utility in simulating physical systems in material science via finite element mod- elling, but with applications in systems with large numbers of small, dense tensors. The initial work of this author was to update the TTL to operate within a graphics processing unit (GPU), build a test suite to verify those updates compiled and generated correct output in a GPU environment, and then analyze performance within a submodule of a finite element solver, the Parallel Generalized Finite Element Solver (PGFEM). Initial characterization work in a GPU environment utilizing the TTL inside a submodule of the PGFEM, the Generalized Constitutive Model (GCM), was not as performant as the raw loop implementation, nor even an MPI distributed memory solution. To determine where the problem lay within the TTL (if at all), microbenchmark tests were developed to examine distinct TTL tensor operations over varying expression categories and complexities. The microbenchmark results were contrary to those observed in the GCM and indicated the TTL was considerably faster than compiler-optimized raw loops. It did however isolate a particular class of tensor operation, tensor inner products, as a point of interest to examine the dichotomous TTL behavior. Additional microbenchmarks were developed to examine the as- sembly code generated by the nVidia C Compiler (NVCC). Those microbenchmarks, stripped of any po- tentially compounding factors that may have cast doubt on the first set of microbenchmarks, validated the previous microbenchmarking results. Analysis of the assembly indicated that, in low order tensors, near- identical assembly could be generated through manual intervention over the compiler’s optimizations, how- ever, it revealed that the compilation pipeline of the NVCC was likely to modify template source code in non-optimal ways. Template specialization of these loop structures should resolve the problem and is cur- rently implemented in the TTL. i Table of Contents Contents Abstract .......................................................................................................................................................... i Foreword & Acknowledgements ................................................................................................................. vi 1. Introduction ............................................................................................................................................... 1 2. Literature Review and Background .......................................................................................................... 2 2.1 A Brief History of Parallel Computation and GPUs ........................................................................... 2 2.2 Mathematical Foundations .................................................................................................................. 3 2.2.1 Tensors ......................................................................................................................................... 3 2.2.2 Linear Solvers ............................................................................................................................ 11 2.3 Computational Hardware .................................................................................................................. 19 2.3.1 CPUs and Host Hardware .......................................................................................................... 20 2.3.2 Memory ...................................................................................................................................... 28 2.3.3 Busses ............................................................................................................................................ 33 2.3.4 GPUs and CUDA ........................................................................................................................... 37 2.3.4.1 CUDA Cores and Streaming Multiprocessors ........................................................................ 39 2.3.4.2 Control Units, Flow, Pipelines, and Branching ....................................................................... 39 2.3.4.3 GPU Memory .......................................................................................................................... 42 3 The Tensor Template Library .................................................................................................................. 46 3.1 Templates .......................................................................................................................................... 46 3.2 Indices ............................................................................................................................................... 46 3.3 Tensors .............................................................................................................................................. 46 4. Methodology ........................................................................................................................................... 48 4.1 Hardware ........................................................................................................................................... 48 4.2 Compilers .......................................................................................................................................... 49 4.3 Tools ................................................................................................................................................. 49 4.4 Updating the TTL to the GPU/CUDAfication .................................................................................. 50 4.4.1 Linear Solver .............................................................................................................................. 51 4.5 Building Test Programs .................................................................................................................... 52 4.5.1 Unit Testing ............................................................................................................................... 52 4.5.2 Performance Characterization in PGFEM ................................................................................. 52 4.5.3 Microbenchmarking ................................................................................................................... 53 ii 4.5.4 Assembly and Compiler Testing ................................................................................................ 54 5. Results ..................................................................................................................................................... 55 5.1 Unit Testing ...................................................................................................................................... 55 5.2 PGFEM and GCM Performance Analysis ........................................................................................ 55 5.3 Microbenchmarking .......................................................................................................................... 59 5.4 Assembly and Compiler Testing ....................................................................................................... 61 6. Discussion ............................................................................................................................................... 65 6.1 Unit Testing and Porting ................................................................................................................... 65 6.2 GCM Performance ............................................................................................................................ 65 6.3 Microbenchmarking .......................................................................................................................... 67 6.4 The Assembly and Compiler............................................................................................................. 67 7. Conclusion .............................................................................................................................................. 73 Works Cited ................................................................................................................................................ 74 Glossary ...................................................................................................................................................... 79 Bibliography ..............................................................................................................................................