Back-Propagation: from Fully Connected to Convolutional Layers

Back-Propagation: From Fully Connected to Convolutional Layers Shuiwang Ji Yaochen Xie Texas A&M University Texas A&M University College Station, TX 77843 College Station, TX 77843 [email protected] [email protected] 1 Introduction This note aims at providing an introduction to the back-propagation (BP) for the convolution. We derive the back-propagation for the convolution from the general case and further show that the convolutional layer is a particular case of fully-connected layer. This document is based on lecture notes by Shuiwang Ji at Texas A&M University and can be used for undergraduate and graduate level classes. 2 Back-Propagation in Fully Connected Layers 2.1 Forward-Propagation and Back-Propagation in General d~ In a general L-layer stacked neural network, we let θk 2 R k denote the parameters (formatted as column vectors) at layer k, and x(k) 2 Rdk denote the outputs (formatted as column vectors) of layer k. The relationship between the input x =: x(0) and the final output (hypothesis) h(x) = x(L) can be represented by a composed function consisting of a series of functions d0 d1 d1 d2 dL−1 dL fθ1 : R ! R ; fθ2 : R ! R ; ··· ; fθL : R ! R ; as h(x; θ1; θ2; ··· ; θL) = fθL (··· (fθ2 (fθ1 (x))) =: fθL ◦ · · · ◦ fθ2 ◦ fθ1 (x); (1) where the parameters θ1; θ2; ··· θL are within the functions. The in-sample error e is yet another function of h(x; θ1; θ2; ··· ). While applying the gradient descent to the e, we are taking the partial derivative of e with respect to the parameters θ. Naturally, the chain rule turns out to be useful when calculating the derivative, i.e., @e @e @x(L) = (L) @θk @x @θk @e @x(L) @x(k) = (L) (k) (2) @x @x @θk @e @x(L) @x(L−1) @x(k+1) @x(k) = ··· ; k = 1 ···L (L) (L−1) (L−2) (k) (3) @x @x @x @x @θk (k) where x = fθk (··· (fθ2 (fθ1 (x))) is the output of the k-th layer. 2.2 Forward-Propagation and Back-Propagation in Fully Connected Layers (d +1)×d We follow the notations in [1] below. Namely, we let Wk 2 R k−1 k denote the weight matrix, where the bias terms are contained in an additional dimension of layer k − 1, let s(k) denote the incoming signal of layer k, and let θ denote the activation function. In a fully-connected layer in dk−1+1 particular, the output of layer k w.r.t. the output of layer k − 1 given by function fWk : R ! Rdk+1 is 1 x(k) = f (x(k−1)) = (4) Wk θ(s(k)) (k) T (k−1) s = (Wk) x (5) Following the chain rule described in Eq. (2) , the partial derivative of e with respect to Wk in the fully-connected case can be derived from (k) @e @s @e = j := x(k−1)δ(k); (k) (k) (k) i j (6) @wij @wij @sj (k) (k) (k) with the fact that for a single weight wij , a change in wij only affects sj . Therefore we have @e = x(k−1)(δ(k))T ; (7) @Wk (k) @e where δ = @s(k) is called the sensitivity vector for layer k. Based on the chain rule, the sensitivity vector can be further computed as (k) 0 (k) (k+1) (k+1) d(k) δ = θ (s ) ⊗ [W δ ]1 (8) (k+1) (k+1) d(k) where ⊗ denotes the element-wise multiplication and [W δ ]1 excludes the first element of W (k+1)δ(k+1). Then we are able to recursively compute the derivative from L to 1. The weights Wk are used in both the forward-propagation when we compute the output of each layer from the output of previous layer, as shown in Eq. (5) , and the back-propagation, when we compute the sensitivity vector of current layer from the sensitivity vector of the next layer, as shown in Eq. (8). As you may have noticed, the weight matrix is transposed in the forward-propagation Eq. (5) but not transposed in the back-propagation Eq. (8). We will find it similar but different in the convolution case. 3 Back-Propagation in Convolutional Layers In this section, we will first introduce the forward-propagation and back-propagation of the convolution in the single channel case. Then we will discuss how the convolution can be a particular case of the fully connected. At the end of this section, we will further discuss how the convolution works with multiple input and output channels. In this section, we refer to the outputs x(k) at current layer as the outputs of the layer k, and the outputs x(k−1) at the previous layer as the input of layer k. And for simplicity, we ignore the bias (k) dk dk+1 terms, i.e., wij 2 R instead of R where dk denotes the number of channels of the inputs to layer k, which makes Eq. (9) different from Eq. (4). 3.1 Forward-Propagation with Single Channel (k) (k) Let spq denote the location (p; q) of the incoming signal s at layer k and xpq denote the location (p; q) of outputs x at layer k. Then in a forward propagation we have: (k) (k) xpq = θ(spq ); (9) 2 (k+1) X X (k) X (k) spq = xp+i;q+jwij =: xp+i;q+jwij; (10) i2K j2K i2K;j2K for all (p; q), where wij denotes weight at location (i; j) in filter F , and K denotes a set of neighbour. For instances, in a 3 × 3 convolution, K = {−1; 0; 1g; and in a 5 × 5 convolution, K = {−2; −1; 0; 1; 2g. 3.2 Back-Propagation with Single Channel During the back-propagation in a convolutional layer, we have (k+1) @e X @e @sp−i;q−j = · (k) (k+1) (k) (11) @xpq i2K;j2K @sp−i;q−j @xpq (k+1) X @e @sp;q = · (k+1) (k) (12) i2K;j2K @sp−i;q−j @xp+i;q+j X @e = · w (k+1) ij (13) i2K;j2K @sp−i;q−j X @e = · w (k+1) −i;−j (14) i2K;j2K @sp+i;q+j Therefore, the sensitivity vector is computed as @e @e δ(k) = = θ0(s(k)) · pq (k) pq (k) (15) @spq @xpq 0 1 X @e = θ0(s(k)) · · w pq @ (k+1) −i;−jA (16) i2K;j2K @sp+i;q+j 0 1 0 (k) X (k+1) = θ (spq ) · @ δp+i;q+j · w−i;−jA (17) i2K;j2K We can see that the indexes of w are opposite in FP and BP, which shows that BP uses the inverted (not transposed) filter in FP. This differs from what we observed and concluded at the end of section 2.2. In the following subsection, we will discuss why this happens. This was also discussed in [2]. 3.3 Connection between Convolution and the Fully Connected Long story short, the convolutional layer is a particular case of the fully connected layer whose weights are shared and non-zero only at its local neighbors. We illustrate this in a 1D-convolution (with padding) case in figure 1a. From a fully connected point of view, the weight matrix W~ of the convolution can be obtained by sliding the kernel from one side to another. And the weights are zero outside the kernel. In this case, we re-write the forward-propagation as follow X sp = xp+iwi (18) i2K = xp−1w−1 + xpw0 + xp+1w1 (19) = xp−1w~p;p−1 + xpw~p;p + xp+1w~p;p+1 (20) = ··· + xp−2 · 0 + xp−1w~p;p−1 + xpw~p;p + xp+1w~p;p+1 + xp+2 · 0 + ··· (21) T =: (W~ )p ·x (22) 3 ~ W’ transpose W’ transpose 0 0 ~ W’ 0 0 ~ W’ 0 w’p-2,p,p-2,p p-2 p 0 w’p-2,p ~ xp-1 w-1 w’p-1,p,p-1,p p-1 p sp-1 w1 w’p-1,p ~ xp-1 -1 p-1,p ~ w p w’ sp-1 w1 w’p,p-1,p p-1 x w0 w’p,p,p p sp w0 w’p,p ~ p ~ x w0 w’p,p sp w0 w’p,p,p p xp+1 w1 w’p+1,p,p+1,p p+1 p sp+1 w-1 w’p+1,p ~ xp+1 w1 w’p+1,p ~ sp+1 -1 p+1,p 0 w’p+2,p,p+2,p p+2 p w w’p, p+1 0 w’0 p+2,p 0 0 0 0 0 xp sp xp sp (a) Forward-Propagation (b) Back-Propagation Figure 1: Convolution as a fully connected: here wi denotes the weight at location i in a kernel ~ and w~i;j denotesW’ transpose the weight at location (i; j) in the weight matrix W in terms of the fully connected perspective. The green vector denotes the (padded) input x(k−1) and the blue vectorW’ denotes the W’ transposeincoming signal s(k−1) at layer k. Note that to better show the correspondence between the 0 w’p-2,p W’ locations, the matrix shown in both figure are put transposed to which we use in Eq. 23 and 0 Eq.xp-1 28. ww’-1p-2,p w’p-1,p sp-1 w1 w’p-1,p xp-1 w-1 xp w’w0p-1,p w’p,p sp-1 w1 sp w0 w’p-1,p w’p,p xp 0 w wherexp+1 wi denotesw’w the1p,p weights in thew’ kernel,p+1,pw~j;k denotes the weights in the fully-connected weight sp w0sp+1 w’p,p T T T w-1 w’p+1,p p+1 matrix W~ , and (W~ ) denotes the p-th row of the transposed weight matrix W~ , i.e.

Load more