Cudabayesreg: Bayesian Computation in CUDA by Adelino Ferreira Da Silva Computing on NVIDIA Manycore Gpus

48 CONTRIBUTED RESEARCH ARTICLES cudaBayesreg: Bayesian Computation in CUDA by Adelino Ferreira da Silva computing on NVIDIA manycore GPUs. The CUDA programming model follows the standard single- Abstract Graphical processing units are rapidly program multiple-data (SPMD) model. CUDA gaining maturity as powerful general parallel greatly simplifies the task of parallel programming computing devices. The package cudaBayesreg by providing thread management tools that work uses GPU–oriented procedures to improve the as extensions of conventional C/C++ constructions. performance of Bayesian computations. The Automatic thread management removes the bur- paper motivates the need for devising high- den of handling the scheduling of thousands of performance computing strategies in the con- lightweight threads, and enables straightforward text of fMRI data analysis. Some features of the programming of the GPU cores. package for Bayesian analysis of brain fMRI data The package cudaBayesreg (Ferreira da Silva, are illustrated. Comparative computing perfor- 2010a) implements a Bayesian multilevel model mance figures between sequential and parallel for the analysis of brain fMRI data in the CUDA implementations are presented as well. environment. The statistical framework in cud- aBayesreg is built around a Gibbs sampler for multi- A functional magnetic resonance imaging (fMRI) level/hierarchical linear models with a normal prior data set consists of time series of volume data in 4D (Ferreira da Silva, 2010c). Multilevel modeling may space. Typically, volumes are collected as slices of be regarded as a generalization of regression meth- 64 x 64 voxels. The most commonly used functional ods in which regression coefficients are themselves imaging technique relies on the blood oxygenation given a model with parameters estimated from data level dependent (BOLD) phenomenon (Sardy, 2007). (Gelman, 2006). As in SPM, the Bayesian model By analyzing the information provided by the BOLD fits a linear regression model at each voxel, but signals in 4D space, it is possible to make inferences uses uses multivariate statistics for parameter esti- about activation patterns in the human brain. The mation at each iteration of the MCMC simulation. statistical analysis of fMRI experiments usually in- The Bayesian model used in cudaBayesreg follows volve the formation and assessment of a statistic im- a two–stage Bayes prior approach to relate voxel age, commonly referred to as a Statistical Paramet- regression equations through correlations between ric Map (SPM). The SPM summarizes a statistic in- the regression coefficient vectors (Ferreira da Silva, dicating evidence of the underlying neuronal acti- 2010c). This model closely follows the Bayesian vations for a particular task. The most common multilevel model proposed by Rossi, Allenby and approach to SPM computation involves a univari- McCulloch (Rossi et al., 2005), and implemented ate analysis of the time series associated with each in bayesm (Rossi and McCulloch., 2008). This ap- voxel. Univariate analysis techniques can be de- proach overcomes several limitations of the classi- scribed within the framework of the general linear cal SPM methodology. The SPM methodology tra- model (GLM) (Sardy, 2007). The GLM procedure ditionally used in fMRI has several important limi- used in fMRI data analysis is often said to be “mas- tations, mainly because it relies on classical hypoth- sively univariate”, since data for each voxel are in- esis tests and p–values to make statistical inferences dependently fitted with the same model. Bayesian in neuroimaging (Friston et al., 2002; Berger and Sel- methodologies provide enhanced estimation accu- lke, 1987; Vul et al., 2009). However, as is often the racy (Friston et al., 2002). However, since (non- case with MCMC simulations, the implementation of variational) Bayesian models draw on Markov Chain this Bayesian model in a sequential computer entails Monte Carlo (MCMC) simulations, Bayesian esti- significant time complexity. The CUDA implemen- mates involve a heavy computational burden. tation of the Bayesian model proposed here has been The programmable Graphic Processor Unit able to reduce significantly the runtime processing (GPU) has evolved into a highly parallel proces- of the MCMC simulations. The main contribution sor with tremendous computational power and for the increased performance comes from the use very high memory bandwidth (NVIDIA Corpora- of separate threads for fitting the linear regression tion, 2010b). Modern GPUs are built around a scal- model at each voxel in parallel. able array of multithreaded streaming multipro- cessors (SMs). Current GPU implementations en- able scheduling thousands of concurrently executing Bayesian multilevel modeling threads. The Compute Unified Device Architecture (CUDA) (NVIDIA Corporation, 2010b) is a software We are interested in the following Bayesian mul- platform for massively parallel high-performance tilevel model, which has been analyzed by Rossi The R Journal Vol. 2/2, December 2010 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 49 et al.(2005), and has been implemented as framework, some design options had to be assumed rhierLinearModel in bayesm. Start out with a gen- in order to properly balance optimization, memory eral linear model, and fit a set of m voxels as, constraints, and implementation complexity, while maintaining numerical correctness. Some design re- iid 2 yi = Xi bi + ei, ei ∼ N 0,si Ini , i = 1,. .,m. (1) quirements for good performance on CUDA are as follows (NVIDIA Corporation, 2010a): (i) the soft- In order to tie together the voxels’ regression equa- ware should use a large number of threads; (ii) dif- tions, assume that the fbig have a common prior dis- ferent execution paths within the same thread block tribution. To build the Bayesian regression model we (warp) should be avoided; (iii) inter-thread commu- need to specify a prior on the {bi} coefficients, and a nication should be minimized; (iv) data should be 2 kept on the device as long as possible; (v) global prior on the regression error variances {si }. Follow- ing Ferreira da Silva(2010c), specify a normal regres- memory accesses should be coalesced whenever pos- 0 sion prior with mean D zi for each b, sible; (vi) the use of shared memory should be pre- ferred to the use of global memory. We detail be- 0 iid low how well these requirements have been met bi = D zi + ni, ni ∼ N 0,Vb , (2) in the code implementation. The first requirement where z is a vector of nz elements, representing char- is easily met by cudaBayesreg. On the one hand, acteristics of each of the m regression equations. fMRI applications typically deal with thousands of The prior (2) can be written using the matrix form voxels. On the other hand, the package uses three of the multivariate regression model for k regression constants which can be modified to suit the avail- coefficients, able device memory, and the computational power B = ZD + V (3) of the GPU. Specifically, REGDIM specifies the maximum number of regressions (voxels), OBSDIM speci- × × where B and V are m k matrices, Z is a m nz fies the maximum length of the time series observa- × matrix, D is a nz k matrix. Interestingly, the prior tions, and XDIM specifies the maximum number of re- (3) assumes the form of a second–stage regression, gression coefficients. Requirements (ii) and (iii) are where each column of D has coefficients which de- satisfied by cudaBayesreg as well. Each thread ex- scribes how the mean of the k regression coefficients ecutes the same code independently, and no inter- varies as a function of the variables in z. In (3), Z thread communication is required. Requirement (iv) assumes the role of a prior design matrix. is optimized in cudaBayesreg by using as much con- The proposed Bayesian model can be written stant memory as permitted by the GPU. Data that down as a sequence of conditional distributions (Fer- do not change between MCMC iterations are kept in reira da Silva, 2010c), constant memory. Thus, we reduce expensive memory data transfers between host and device. For in- y j X , b ,s2 i i i i stance, the matrix of voxel predictors X (see (1)) is bi j zi,D,Vb kept in constant memory. Requirement (v) is insuf- 2 2 ficiently met in cudaBayesreg. For maximum per- si j ni,si (4) formance, memory accesses to global memory must V j n,V b be coalesced. However, different fMRI data sets ¯ D j Vb,D, A. and parameterizations may generate data structures with highly variable dimensions, thus rendering co- Running MCMC simulations on the set of full condi- alescence difficult to implement in a robust manner. tional posterior distributions (4), the full posterior for Moreover, GPU devices of Compute Capability 1.x all the parameters of interest may then be derived. impose hard memory coalescing restrictions. Fortu- nately, GPU devices of Compute Capability 2.x are expected to lift some of the most taxing memory co- GPU computation alescing constraints. Finally requirement (vi) is not met by the available code. The current kernel im- In this section, we describe some of the main de- plementation does not use shared memory. The rel- sign considerations underlying the code implemen- ative complexity of the Bayesian computation per- tation in cudaBayesreg, and the options taken for formed by each thread compared to the conventional processing fMRI data in parallel. Ideally, the GPU arithmetic load assigned to the thread has prevented is best suited for computations that can be run us from exploiting shared memory operations. The on numerous data elements simultaneously in par- task assigned to the kernel may be subdivided to re- allel (NVIDIA Corporation, 2010b). Typical text- duce kernel complexity. However, this option may book applications for the GPU involve arithmetic easily compromise other optimization requirements, on large matrices, where the same operation is per- namely thread independence. As detailed in the next formed across thousands of elements at the same paragraph, our option has been to keep the compu- time.

Cudabayesreg: Bayesian Computation in CUDA by Adelino Ferreira Da Silva Computing on NVIDIA Manycore Gpus

CS 2210: Theory of Computation

The Problem with Threads

Computer Architecture: Dataflow (Part I)

Ece585 Lec2.Pdf

10. Assembly Language, Models of Computation

Actor Model of Computation

Computer Architecture Lecture 1: Introduction and Basics Where We Are

A Model of Computation for VLSI with Related Complexity Results

Neuromorphic Computing Is Turing-Complete and Therefore Capable of General-Purpose Computing

Introduction to Theory of Computation

Instruction Set Completeness Theorem: Concept Relevance Proof

Arxiv:1707.06202V1 [Cs.ET] 19 Jul 2017