CUDA-Level Performance with Python-Level Productivity for Gaussian Mixture Model Applications H

CUDA-level Performance with Python-level Productivity for Gaussian Mixture Model Applications H. Cook, E. Gonina, S. Kamil, G. Friedlandy, D. Patterson, and A. Fox Parallel Computing Laboratory, Computer Science Division, University of California at Berkeley yInternational Computer Science Institute fhcook, egonina, [email protected], [email protected], fpattrsn, [email protected] Abstract ants and tuning parameters, often yields code that surpasses the performance of expert-created low-level code [21]. Yet Typically, scientists with computational needs prefer to use even when tuning is complete, running the tuned code on a high-level languages such as Python or MATLAB; however, different hardware platform or a new problem size may result large computationally-intensive problems must eventually be in unpredictable performance cliffs [10]. recoded in a low level language such as C or Fortran by ex- The tedious process of code variant selection and parame- pert programmers in order to achieve sufficient performance. ter tuning works against domain-programmer productivity (if In addition, multiple strategies may exist for mapping a prob- it is within the domain programmer’s expertise at all). Auto- lem onto parallel hardware; unless the hardware geometry tuning libraries such as OSKI [19] and Spiral [18] attempt to and problem dimensions are both taken into account, large encapsulate multiple code variants and heuristics for choos- factors of performance may be left on the table. We show how ing among them, as well as heuristics for selecting tuning pa- to preserve the productivity of high-level languages while ob- rameters for the chosen variant; however, this machinery is taining the performance of the best low-level language code specific to each library and not generally repurposable. variant for a given hardware platform and problem size using In this paper we show that the mechanism and policy for SEJITS (Selective Embedded Just-in-Time Specialization), a variant selection and tuning can be separated from the appli- set of techniques that leverages just-in-time code generation cation logic in a way that increases productivity for both the and compilation combined with reflection and metaprogram- application programmer and the performance tuning special- ming. As a case study, we demonstrate our technique for ist. Our framework allows the programmer to express her ap- Gaussian Mixture Model training using the EM algorithm. plication in a highly productive language (Python). Adding With the addition of one line of code to import our frame- a single import statement pulls in a set of just-in-time code work, a domain programmer using an existing Python GMM generation mechanisms and hides the variant selection logic library can run her program unmodified on a GPU-equipped from the domain expert programmer, synthesizing the “best” computer and achieve performance that meets or beats GPU variant at runtime and giving performance comparable to or code hand-crafted by a human expert. We also show that de- better than hand-coded implementations by a human expert. spite the overhead of allowing the domain expert’s program Our case study focuses on a computationally-intensive al- to use Python and the overhead of just-in-time code gener- gorithm for training Gaussian Mixture Models (GMMs), a ation and compilation, our approach still results in perfor- particular class of statistical models used in speech recogni- mance competitive with hand-crafted GPU code. tion, image segmentation, document classification, and nu- 1 Introduction merous other areas. The iterative and highly data-parallel algorithm is amenable to execution on GPUs; however, de- Domain experts coding computationally-intensive pro- pending on the hardware geometry and dimensionality of the grams would prefer to work at a high level of abstraction such input data (which varies greatly across application domains), as that afforded by scripting languages like Python or MAT- different implementations of the algorithm will give the best LAB. However, it is widely accepted that recoding compute- attainable performance. intensive “kernels” in lower-level languages to express ex- We briefly describe our case study problem, present four plicit parallelism can yield one to three orders of magnitude in strategies for parallelizing it onto GPUs, and demonstrate that performance improvements, creating a tension between pro- the selection of the best variant and the optimization parame- grammer productivity and high performance. ters to use with that variant is nontrivial. We then describe our The advent of multicore CPUs and manycore GPUs ag- framework, ASP, that allows this variant selection and code gravates this tension: the best parallel implementation of a generation process to be encapsulated in a way that is hid- particular algorithm now depends on the target hardware and den from the domain expert. Without any variant selection, the specific input data, as evidenced by the fact that auto- there is immediate performance gain of three orders of mag- tuning, the automated search of possible implementation vari- nitude for realistic problems compared to executing the com- 1 putation in pure Python. With even a simple variant selection 3 Benefits of Code Variant Selection algorithm, an average of 32% further performance improve- ment relative to always using a single baseline parallel code The covariance computation exhibits a large amount of variant is possible, with the best-performing variant surpass- parallelism due to the mutual independence of each cluster’s ing the performance of human-expert-authored C++/CUDA covariance matrix, each cell in a covariance matrix, and each code. From the domain programmer’s point of view, a one- observation’s contribution to a cell in a covariance matrix. line change to any Python program that uses an existing These three possible degrees of freedom in data parallelism GMM library suffices to get these performance benefits. suggest different strategies for parallelizing the algorithm on manylane hardware. Indeed, the optimal strategy depends on 2 Background: GMMs and the EM Algorithm the problem parameters (N, D, M) as well as certain hardware parameters (e.g. number of cores, SIMD vector width, Suppose we are given audio of a conversation that is local memory size). Figure 1 summarizes the four code vari- known to feature M distinct speakers. We could repre- ants described below for a problem size of M = 2, D = 4 sent each speaker’s speech characteristics with a probabilistic and N = 7. model. We could then attempt to model the conversation as We use the platform-neutral OpenCL [13] terminology to a weighted combination of the M models, without knowing describe our strategies, which are implemented in NVIDIA’s in advance which speaker made which utterances. This is the CUDA language [15]. There are two levels of parallelism: basic idea of a mixture model. To train a mixture model is workgroups are parallelized across cores on the chip, and a to determine the parameters of each of the M submodels and workgroup’s work-items are executed on a single core, po- the weights corresponding to each model in the mixture, such tentially utilizing that core’s SIMD vector unit. Each core that we maximize the probability that the observed data (the has a scratchpad memory, referred to as a local memory. audio track) corresponds to a prediction of the overall mixture model. Code Variant 1 (V1)—baseline: The EM on CUDA imple- D In the case of Gaussian mixture models (GMMs), each mentation from Pangborn [17]. Launches M × D × 2 submodel is assumed to be represented by a D-dimensional workgroups. Each workgroup is responsible for one cell Gaussian distribution of mean µi and D × D covariance ma- in the covariance matrix for one cluster. Work-items cor- trix Σi. Given N observed data points, each a D-dimensional respond to events (N). The mean vector is stored in lo- feature vector, we need to learn the parameters µi; Σi for each cal memory, however only two values are used (corre- submodel and the weight parameters πi for combining them sponding to the row and column of the cell the group is into the overall mixture model. A common way to learn these computing). parameters is to use the Expectation-Maximization (EM) al- Code Variant 2 (V2): Modifies V1 by assigning each work- gorithm [7]: given an initial estimate of the parameters, the group to compute one cell for all clusters. Work-items E-step computes the expectation of the log-likelihood of the correspond to events as in V1. observations given those parameters, and the M-step in turn computes the parameters that maximize the expected log- Code Variant 3 (V3): Makes better use of per-core memory likelihood of the observation data. These two steps repeat by assigning each workgroup to compute the entire co- until a convergence criterion is reached. variance matrix for one cluster (M). Each work-item Our specializer emits parallelized code for all substeps of in the workgroup is responsible for one cell in the co- the EM algorithm. We apply additional effort to the most D variance matrix (D × 2 items). Each work-item loops compute-intensive part of the algorithm, which occurs in the through all events sequentially. M-step when we compute the D ×D covariance matrix Σ for each of the M clusters. As described in [17], the covariance Code Variant 4 (V4-BXX): Improves upon V3 by making matrix is the sum of the outer products of the difference be- it more resilient to small M by adding blocking across tween the observation vectors and the cluster’s mean vector the N dimension. Launches M × B workgroups, where computed in this iteration: B is a blocking factor, i.e. the number of desired event blocks. Each workgroup computes the contribution to PN (k+1) (k+1) T N (pi;j(xj − µi )(xj − µi ) ) its entire covariance matrix for its block of events ( ).

CUDA-Level Performance with Python-Level Productivity for Gaussian Mixture Model Applications H

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Scalable and High Performance MPI Design for Very Large

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor

Library for Handling Asynchronous Events in C++

Intel Threading Building Blocks

Lecture 15-16: Parallel Programming with Cilk

SYCL – a Gentle Introduction

High Performance Integration of Data Parallel File Systems and Computing

Trisycl Open Source C++17 & Openmp-Based Opencl SYCL Prototype

Optimizing Apache Spark* to Maximize Workload Throughput