VIBNN: Hardware Acceleration of Bayesian Neural Networks

VIBNN: Hardware Acceleration of Bayesian Neural Networks Ruizhe Cai∗ Luhao Wang Yanzhi Wang Ao Ren∗ Xuehai Qian Department of Electrical Engineering Ning Liu Massoud Pedram and Computer Science, Syracuse Caiwen Ding Department of Electrical Engineering, University Syracuse, New York Department of Electrical Engineering University of Southern California [email protected] and Computer Science, Syracuse Los Angeles, California University {luhaowan,xuehai.qian,pedram}@usc. Syracuse, New York edu {rcai100,aren,nliu03,cading}@syr.edu ABSTRACT of Bayesian Neural Networks. In ASPLOS ’18: 2018 Architectural Support Bayesian Neural Networks (BNNs) have been proposed to address for Programming Languages and Operating Systems, March 24–28, 2018, the problem of model uncertainty in training and inference. By Williamsburg, VA, USA. ACM, New York, NY, USA, 13 pages. https://doi.org/ 10.1145/3173162.3173212 introducing weights associated with conditioned probability distributions, BNNs are capable of resolving the overfitting issue commonly seen in conventional neural networks and allow for small- 1 INTRODUCTION data training, through the variational inference process. Frequent As a key branch of machine learning and artificial intelligence tech- usage of Gaussian random variables in this process requires a prop- niques, Artificial Neural Networks (ANNs) have been introduced to erly optimized Gaussian Random Number Generator (GRNG). The create machines that can learn and inference [22]. Many different high hardware cost of conventional GRNG makes the hardware types and models of ANNs have been developed for a variety of ap- implementation of BNNs challenging. plications and higher performance, including Convolutional Neural In this paper, we propose VIBNN, an FPGA-based hardware Networks (CNNs), Multi-Layer Perceptron Networks (MLPs), Recur- accelerator design for variational inference on BNNs. We explore rent Neural Networks (RNNs), etc. [44]. With the development and the design space for massive amount of Gaussian variable sampling broad applications of deep learning algorithms, neural networks tasks in BNNs. Specifically, we introduce two high performance have recently achieved tremendous success in various fields, such Gaussian (pseudo) random number generators: 1) the RAM-based as image classification, object recognition, natural language pro- Linear Feedback Gaussian Random Number Generator (RLF-GRNG), cessing, autonomous driving, cancer detection, etc. [1, 14, 45, 47]. which is inspired by the properties of binomial distribution and With the success of deep learning, a rising amount of recent linear feedback logics; and 2) the Bayesian Neural Network-oriented works studied the highly parallel computing paradigm and the Wallace Gaussian Random Number Generator. To achieve high hardware implementations of neural networks[2, 12, 13, 24, 29, scalability and efficient memory access, we propose a deep pipelined 40, 42, 46]. These hardware approaches typically accelerate the accelerator architecture with fast execution and good hardware inference process of neural networks and have shown promising utilization. Experimental results demonstrate that the proposed performances in terms of speed, energy efficiency, and accuracy, VIBNN implementations on an FPGA can achieve throughput of making this approach ideal for embedded and IoT systems. Despite 321,543.4 Images/s and energy efficiency upto 52,694.8 Images/J the significant progress of neural network acceleration , it is well while maintaining similar accuracy as its software counterpart. known that conventional neural networks are prone to the overfitting issue — situations where the model fail to generalize well KEYWORDS from the training data to the test data [20]. The fundamental reason Bayesian Neural Network, Neural Network, FPGA is that traditional neural network models fail to provide estimates with uncertainty information [9]. This missing characteristic is cru- ACM Reference Format: cial to avoid making over-confident decisions, especially for many Ruizhe Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, and Yanzhi Wang. 2018. VIBNN: Hardware Acceleration supervised learning applications with missing or noisy training data. ∗Ruizhe Cai and Ao Ren contributed equally to this work. To solve this issue, the ensemble model has been introduced [17, 21] to combine the results from multiple neural network models, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed so that the degraded generalization performance can be avoided. for profit or commercial advantage and that copies bear this notice and the full citation As a key example, Bayesian Neural Network (BNNs) are capable of on the first page. Copyrights for components of this work owned by others than ACM forming ensemble models while maintaining limited memory space must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a overhead [32, 58]. In addition, unlike conventional neural networks fee. Request permissions from [email protected]. that rely on huge amount of data for training, BNNs can easily learn ASPLOS ’18, March 24–28, 2018, Williamsburg, VA, USA from small datasets, with the ability to offer uncertainty estimates © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-4911-6/18/03...$15.00 and the robustness to mitigate over-fitting issues[20]. Moreover, https://doi.org/10.1145/3173162.3173212 the overall accuracy can be improved as well. Specifically, BNNs apply Bayesian inference to provide the prin- of VIBNN are orthogonal to the optimization techniques on convo- cipled uncertainty estimates. In contrast to traditional neural net- lutional layers in previous works [16, 27, 36], and can be applied to works whose weights are fixed values, each weight in BNNs isa CNNs and RNNs as well. random number following a posterior probability distribution, which Experimental results suggest that the proposed VIBNN can achieve is conditioned on a prior probability and its observed data. Unfor- similar accuracy as its software counterpart (BNNs) with very high tunately, the exact Bayesian inference in general is an intractable energy efficiency of 52694:9 images/J thanks to the proposed GRNG problem and obtaining closed-form solutions requires either the structure. assumption of special families of models[11] or the availability of probability distributions [21]. Therefore, an approximation method 2 BNNS USING VARIATIONAL INFERENCE of Bayesian inference is generally used to ensure low computational 2.1 Bayesian Model, Variational Inference, and complexity and high degree of generality [8, 48]. Among various approximation techniques, variational approximation, or variational Gaussian Approximation inference, tends to be faster and easier to implement. Besides, the For a general Bayesian model, the latent variables are w and observed variational inference method offers better scalability with large data points are D. From the Bayes rule, the posterior probability can models, especially for large-scale neural networks in deep learning be calculated as: applications[26], compared with other commonly adopted Bayesian P¹DjwºP¹wº P¹wjDº = (1) inference methods such as Markov Chain Monte Carlo (MCMC) [4]. P¹Dº In addition to faster computation, the variational inference method can efficiently represent weights (probability distributions) with where P¹wº is called the prior probability that indicates the proba- limited amount of parameters. The Bayes-by-Backprop algorithm bility of latent variables w before any data observations. P¹Djwº is proposed by Bluendell et al.[9], for instance, only doubles the pa- called the likelihood, which is the probability of the data D based rameters compared to ANNs while achieving an infinitely large on the latent variable observations w. The denominator P¹Dº is ensemble of models. calculated as the integral of sum over all possible latent variables, ¯ Our focused BNNs in this work belong to the category of feed- i.e., P¹Dº = P¹DjwºP¹wºdw. forward neural networks (FNNs) that have achieved great successes For most of applications of interests, this integral process is in many important fields, such as the HIGGS challenge, the Merck intractable, therefore effective approaches are needed to approxi- Molecular Activity challenge, and the Tox21 Data challenge [30]. mately estimate/evaluate the posterior probability. Variational in- Despite the tremendous attentions on CNNs and RNNs, the accel- ference[28, 52] is a machine learning method for approximating the erators for FNN models are imperative as well, which is noted in posterior probability densities in Bayesian inference models, with a the recent Google paper [29] and the very recently invented SeLU higher convergence rate (compared to MCMC method) and scala- technique [30]. With the recent shift in various fields towards the bility to large problems. As shown in [21], the variational inference deployment of BNNs [9, 19, 50], hardware acceleration for BNNs method posits a family of probability distributions q¹w;θº with becomes critical and has not been well considered in prior works. variation parameters θ to approximate the posterior distribution However, hardware realizations of BNNs pose a fundamental chal- p¹wjDº. lenge compared to the traditional ANNs: the frequent operations

Load more