A Bias-Variance Decomposition for Bayesian Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
A Bias-Variance Decomposition for Bayesian Deep Learning James Brofos∗ Rui Shu Roy R. Lederman Yale University Stanford University Yale University Abstract We exhibit a decomposition of the Kullback-Leibler divergence into terms corre- sponding to bias, variance, and irreducible error. Our particular focus in this work is Bayesian deep learning and in this domain we illustrate the application of this decomposition to adversarial example identification, to image segmentation, and to malware detection. We empirically demonstrate qualitative similarities between the variance decomposition and mutual information. 1 Introduction Bias-variance decompositions have an important role in the theory and practice of statistics. They can be used to understand the behavior of estimators, to deduce optimality criteria, and to diagnose systemic abnormalities arising from under- and over-fitting. Bias-variance decompositions can also be leveraged to characterize unusual observations based on their uncertainty characterization. Perhaps the most well-known bias-variance decomposition is the following. Example 1.1 (Mean Squared Error Decomposition). Let y =y ¯ + where E = 0 and Var() = σ2 and y¯ 2 R. Let y^θ be a predictor of y, with the parameters of the predictor distributed as θ ∼ π. We note that this characterization comprises a broad class of inference problems, including the case where y and y^θ may depend additionally on a set of features as in regression problems. In this case, θ can be the weights of a neural network and the predictor would also be a function of the input to the network. Then the mean squared error of y^ and y decomposes as, 2 2 2 E E(y − y^θ) = (¯y − Ey^θ) + Var(^yθ) + σ (1) θ θ This first term is the squared “bias” between the true mean and the predictive mean, the second is the variance of the prediction, and the third is the irreducible error (noise that is inherent to the generative process for y that could never be eliminated). Notice that in this decomposition there is a natural sense in which y~ = Eθy^θ is a “centroid” of the random variable y^θ for a squared error loss function; it is well-known that the expectation of a random variable minimizes the squared distance of a random draw to a fixed point. Therefore, the bias captures the squared distance of the true mean to the centroid while the variance captures the average squared distance of a random y^θ to y~. This notion of a centroid will be relevant to our decomposition. We consider the intersection of the bias-variance decomposition with the field of Bayesian deep learning. Although the bias-variance decomposition contained in this work is not specific to Bayesian deep learning in particular, we believe that there is an opportunity to exhibit the benefits of a bias-variance decomposition to applications involving Bayesian neural networks. In contrast to point- estimate neural network models, the promise of Bayesian deep learning is to provide uncertainty ∗Approved for Public Release; Distribution Unlimited. Public Release Case Number 19-2139. c 2019 The authors and The MITRE Corporation. 4th workshop on Bayesian Deep Learning (NeurIPS 2019), Vancouver, Canada. estimates for parameters. The principal difficulty in utilizing Bayesian methods in deep learning is the intractability of the posterior; therefore, we leverage a form of variational inference in order to efficiently generate approximate samples from the neural network posterior. In this work, we seek to propagate this uncertainty characterization through a decision-theoretic lens in order to obtain a general bias-variance decomposition for Bayesian deep learning. An objective in the empirical assessment portion of this work is to assess the usefulness of the bias-variance decomposition on a variety of compelling and real-world applications. The organization of the paper is as follows. In section 3 we first establish the objective and methods of Bayesian deep learning. Thereafter, we review the pertinent theory and describe, using information- theoretic techniques, a bias-variance characterization that is suitable for Bayesian deep learning, with particular focus on classification tasks. This section includes multiple examples that demonstrate the character of proposed decomposition. Section 2 examines related research on general bias-variance decompositions. In section 4 we examine the application of the bias-variance decomposition in several domains: (i) we demonstrate the usefulness of the decomposition in for characterizing unusual and adversarial inputs; (ii) we apply the bias-variance decomposition to a per-pixel image segmentation task and show the effect of contraction of the posterior on the decomposition; (iii) we utilize the bias-variance decomposition to examine software executables and illustrate the regimes of high- uncertainty predictions and evasive malware. In section 5 we conclude and offer some perspective on future work. 2 Related Work The most general treatment of bias-variance decompositions is [8], which considers Bregman di- vergences. Such a general approach is appealing, but suffers the immediate drawback that a notion of a “centroid” for the Bregman divergence may not have a convenient form. [5] also develops a bias-variance decomposition for general loss functions, but breaks with the tradition that loss should equal the sum of bias and variance terms and specializes to classification tasks where the loss is computed as a function of classification decisions (in contrast, we will adopt a distribution-theoretic approach by seeking to estimate a distribution over classes). The bias-variance decomposition we derive in corollary 3.2 arises as a special case of theorem 3.1; however, the corollary 3.2 has been derived directly in [2]. We see an opportunity for a rigorous empirical evaluation of this bias-variance decomposition, which is the primary purpose of this work. 3 Theory In this section we introduce theory for Bayesian deep learning and the information-theoretic treatment of the bias-variance decomposition. In section 4 we combine these frameworks in application to several domains. 3.1 Bayesian Deep Learning Bayesian deep learning offers a probabilistic alternative to neural networks estimated via maximum a posteriori or maximum likelihood. Let θ denote the parameters of a neural network; given a prior π(θ), Bayesian deep learning infers a posterior π(θjY; X) / L(YjX; θ)π(θ) (2) where L is the likelihood function. This work focuses on classification tasks and takes L to be the negative categorical cross entropy. For regression tasks a natural choice of likelihood function is the exponentiated mean squared error. With the Bayesian neural network posterior we can compute the posterior predictive distribution of y∗ given x∗: Z ∗ ∗ ∗ ∗ f(y jx ) = fθ(y jx )π(θjY; X) dθ (3) This has the effect of integrating over uncertainty in the network weights. The principle difficulty in Bayesian deep learning, as in other Bayesian methods, is that the normalizing constant of posterior 2 is intractable, requiring an jθj-dimensional integral over the parameter space. In the numerical examples in this work, we adopt an approximate inference procedure called Monte Carlo dropout. This procedure estimates the true posterior by multiplying the per-layer weight matrices by diagonal matrices consisting of i.i.d. Bernoulli random variables with failure probability p. 3.2 A Bias-Variance Decomposition In this work we will use D(P kQ) to denote the KL-divergence between probability measures P and Q. We use the notation fθ to denote a probability distribution depending on (possibly random) parameters θ. Theorem 3.1 (Bias-Variance-Irreducible Error Decomposition of the KL-Divergence). Let θ ∼ π 0 and ! ∼ π and let fθ be a distribution with the same support as a random probability distribution P! for all θ and !. Then, if Y ∼ P¯, ¯ ~ ~ E E D(P!kfθ) = D(P kP ) + E D(P kfθ) + I(Y ; !) (4) !∼π0θ∼π θ∼π | {z } | {z } | {z } “squared bias” “variance” “irreducible error” where P¯ is the marginal ¯ P (·) = E P!(·) (5) !∼π0 and where ~ P (·) / exp E log fθ(·) (6) θ∼π is the “geometric mixture” of fθ over θ ∼ π. The proof of theorem 3.1 relies on the “forward” and “backward” compensation identities in informa- tion theory; see appendix B. The first term of the decomposition is naturally regarded as (squared) bias because it measures the closeness, in terms of KL-divergence, of the target distribution P to the geometric mixture; theorem 3.4 shows why the geometric mixture has a meaningful interpretation as a centroid of the posterior. The second term in the decomposition has an interpretation as the variance because it measures the average divergence of a random fθ to the geometric mixture centroid. Under this interpretation, both the bias and the variance are measured in units of the logarithmic base, which in these experiments are natural units (nats). An important special case of theorem 3.1 is when P ≡ P! so that there is no randomness in the first argument to the LHS KL-divergence. Corollary 3.2 (Brinda, Klusowski, and Yang [2]). Suppose, in addition, that P ≡ P!. Then, there is no irreducible error and the bias-variance decomposition becomes, ~ ~ E D(P kfθ) = D(P kP ) + E D(P kfθ) (7) θ∼π θ∼π Corollary 3.2 can be derived directly from the reverse compensation identity. 3.3 The Geometric Mixture Centroid The geometric mixture P~ does not feature prominently in machine learning so it is worth recalling some properties it exhibits. We begin by offering some results that motivate why one should consider the geometric mixture at all. The following results are due to [2, 3] and a proof is contained in appendix B. Theorem 3.3 (The Geometric Mixture is a Bayes Rule). Consider the KL-divergence loss L(θ; P (Λ)) = D(P (Λ)kfθ) where Λ is data and P (Λ) is an estimator of fθ.