<<

Back-Propagation: From Fully Connected to Convolutional Layers

Shuiwang Ji Yaochen Xie Texas A&M University Texas A&M University College Station, TX 77843 College Station, TX 77843 [email protected] [email protected]

1 Introduction

This note aims at providing an introduction to the back-propagation (BP) for the convolution. We derive the back-propagation for the convolution from the general case and further show that the convolutional is a particular case of fully-connected layer. This document is based on lecture notes by Shuiwang Ji at Texas A&M University and can be used for undergraduate and graduate level classes.

2 Back-Propagation in Fully Connected Layers

2.1 Forward-Propagation and Back-Propagation in General

d˜ In a general L-layer stacked neural network, we let θk ∈ R k denote the parameters (formatted as column vectors) at layer k, and x(k) ∈ Rdk denote the outputs (formatted as column vectors) of layer k. The relationship between the input x =: x(0) and the final output (hypothesis) h(x) = x(L) can be represented by a composed consisting of a series of functions

d0 d1 d1 d2 dL−1 dL fθ1 : R → R , fθ2 : R → R , ··· , fθL : R → R , as

h(x; θ1, θ2, ··· , θL) = fθL (··· (fθ2 (fθ1 (x))) =: fθL ◦ · · · ◦ fθ2 ◦ fθ1 (x), (1) where the parameters θ1, θ2, ··· θL are within the functions.

The in-sample error e is yet another function of h(x; θ1, θ2, ··· ). While applying the to the e, we are taking the partial of e with respect to the parameters θ. Naturally, the chain rule turns out to be useful when calculating the derivative, i.e.,

∂e ∂e ∂x(L) = (L) ∂θk ∂x ∂θk ∂e ∂x(L) ∂x(k) = (L) (k) (2) ∂x ∂x ∂θk ∂e ∂x(L) ∂x(L−1) ∂x(k+1) ∂x(k) = ··· , k = 1 ···L (L) (L−1) (L−2) (k) (3) ∂x ∂x ∂x ∂x ∂θk

(k) where x = fθk (··· (fθ2 (fθ1 (x))) is the output of the k-th layer.

2.2 Forward-Propagation and Back-Propagation in Fully Connected Layers

(d +1)×d We follow the notations in [1] below. Namely, we let Wk ∈ R k−1 k denote the weight matrix, where the bias terms are contained in an additional dimension of layer k − 1, let s(k) denote the incoming of layer k, and let θ denote the . In a fully-connected layer in dk−1+1 particular, the output of layer k w.r.t. the output of layer k − 1 given by function fWk : R → Rdk+1 is

 1  x(k) = f (x(k−1)) = (4) Wk θ(s(k))

(k) T (k−1) s = (Wk) x (5)

Following the chain rule described in Eq. (2) , the of e with respect to Wk in the fully-connected case can be derived from

(k) ∂e ∂s ∂e = j := x(k−1)δ(k), (k) (k) (k) i j (6) ∂wij ∂wij ∂sj

(k) (k) (k) with the fact that for a single weight wij , a change in wij only affects sj . Therefore we have

∂e = x(k−1)(δ(k))T , (7) ∂Wk

(k) ∂e where δ = ∂s(k) is called the sensitivity vector for layer k. Based on the chain rule, the sensitivity vector can be further computed as

(k) 0 (k) (k+1) (k+1) d(k) δ = θ (s ) ⊗ [W δ ]1 (8)

(k+1) (k+1) d(k) where ⊗ denotes the element-wise and [W δ ]1 excludes the first element of W (k+1)δ(k+1). Then we are able to recursively compute the derivative from L to 1.

The weights Wk are used in both the forward-propagation when we compute the output of each layer from the output of previous layer, as shown in Eq. (5) , and the back-propagation, when we compute the sensitivity vector of current layer from the sensitivity vector of the next layer, as shown in Eq. (8). As you may have noticed, the weight matrix is transposed in the forward-propagation Eq. (5) but not transposed in the back-propagation Eq. (8). We will find it similar but different in the convolution case.

3 Back-Propagation in Convolutional Layers

In this section, we will first introduce the forward-propagation and back-propagation of the convo- lution in the single channel case. Then we will discuss how the convolution can be a particular case of the fully connected. At the end of this section, we will further discuss how the convolution works with multiple input and output channels. In this section, we refer to the outputs x(k) at current layer as the outputs of the layer k, and the outputs x(k−1) at the previous layer as the input of layer k. And for simplicity, we ignore the bias (k) dk dk+1 terms, i.e., wij ∈ R instead of R where dk denotes the number of channels of the inputs to layer k, which makes Eq. (9) different from Eq. (4).

3.1 Forward-Propagation with Single Channel

(k) (k) Let spq denote the location (p, q) of the incoming signal s at layer k and xpq denote the location (p, q) of outputs x at layer k. Then in a forward propagation we have:

(k) (k) xpq = θ(spq ), (9)

2 (k+1) X X (k) X (k) spq = xp+i,q+jwij =: xp+i,q+jwij, (10) i∈K j∈K i∈K,j∈K for all (p, q), where wij denotes weight at location (i, j) in filter F , and K denotes a set of neighbour. For instances, in a 3 × 3 convolution, K = {−1, 0, 1}; and in a 5 × 5 convolution, K = {−2, −1, 0, 1, 2}.

3.2 Back-Propagation with Single Channel

During the back-propagation in a convolutional layer, we have

(k+1) ∂e X ∂e ∂sp−i,q−j = · (k) (k+1) (k) (11) ∂xpq i∈K,j∈K ∂sp−i,q−j ∂xpq (k+1) X ∂e ∂sp,q = · (k+1) (k) (12) i∈K,j∈K ∂sp−i,q−j ∂xp+i,q+j X ∂e = · w (k+1) ij (13) i∈K,j∈K ∂sp−i,q−j X ∂e = · w (k+1) −i,−j (14) i∈K,j∈K ∂sp+i,q+j Therefore, the sensitivity vector is computed as

∂e ∂e δ(k) = = θ0(s(k)) · pq (k) pq (k) (15) ∂spq ∂xpq   X ∂e = θ0(s(k)) · · w pq  (k+1) −i,−j (16) i∈K,j∈K ∂sp+i,q+j   0 (k) X (k+1) = θ (spq ) ·  δp+i,q+j · w−i,−j (17) i∈K,j∈K

We can see that the indexes of w are opposite in FP and BP, which shows that BP uses the inverted (not transposed) filter in FP. This differs from what we observed and concluded at the end of section 2.2. In the following subsection, we will discuss why this happens. This was also discussed in [2].

3.3 Connection between Convolution and the Fully Connected

Long story short, the convolutional layer is a particular case of the fully connected layer whose weights are shared and non-zero only at its local neighbors. We illustrate this in a 1D-convolution (with padding) case in figure 1a. From a fully connected point of view, the weight matrix W˜ of the convolution can be obtained by sliding the kernel from one side to another. And the weights are zero outside the kernel. In this case, we re-write the forward-propagation as follow

X sp = xp+iwi (18) i∈K

= xp−1w−1 + xpw0 + xp+1w1 (19) = xp−1w˜p,p−1 + xpw˜p,p + xp+1w˜p,p+1 (20) = ··· + xp−2 · 0 + xp−1w˜p,p−1 + xpw˜p,p + xp+1w˜p,p+1 + xp+2 · 0 + ··· (21) T =: (W˜ )p ·x (22)

3 ~ W’ transpose W’ transpose 0 0 ~ W’ 0 0 ~ W’ 0 w’p-2,p,p-2,p p-2 p 0 w’p-2,p ~ xp-1 w-1 w’p-1,p,p-1,p p-1 p sp-1 w1 w’p-1,p ~ xp-1 -1 p-1,p ~ w p w’ sp-1 w1 w’p,p-1,p p-1 x w0 w’p,p,p p sp w0 w’p,p ~ p ~ x w0 w’p,p sp w0 w’p,p,p p xp+1 w1 w’p+1,p,p+1,p p+1 p sp+1 w-1 w’p+1,p ~ xp+1 w1 w’p+1,p ~ sp+1 -1 p+1,p 0 w’p+2,p,p+2,p p+2 p w w’p, p+1 0 w’0 p+2,p 0 0 0 0 0 xp sp xp sp (a) Forward-Propagation (b) Back-Propagation

Figure 1: Convolution as a fully connected: here wi denotes the weight at location i in a kernel ˜ and w˜i,j denotesW’ transpose the weight at location (i, j) in the weight matrix W in terms of the fully connected perspective. The green vector denotes the (padded) input x(k−1) and the blue vectorW’ denotes the W’ transposeincoming signal s(k−1) at layer k. Note that to better show the correspondence between the 0 w’p-2,p W’ locations, the matrix shown in both figure are put transposed to which we use in Eq. 23 and 0 Eq.xp-1 28. ww’-1p-2,p w’p-1,p sp-1 w1 w’p-1,p xp-1 w-1 xp w’w0p-1,p w’p,p sp-1 w1 sp w0 w’p-1,p w’p,p xp 0 w wherexp+1 wi denotesw’w the1p,p weights in thew’ kernel,p+1,pw˜j,k denotes the weights in the fully-connected weight sp w0sp+1 w’p,p T T T w-1 w’p+1,p p+1 matrix W˜ , and (W˜ ) denotes the p-th row of the transposed weight matrix W˜ , i.e. the p-th x w1 w’0 p+1,pp · p+2,p w’ sp+1 column of the matrix W˜ . Thus we have w-1 w’p+1,p 0 w’0 p+2,p

0 s = W˜ T x (23) sp Similarly, we can re-write the back-propagation process as xp sp xp

∂e X ∂e = · w−i (24) ∂xp ∂sp+i i∈K ∂e ∂e ∂e = · w−1 + · w0 + · w1 (25) ∂sp+1 ∂sp ∂sp−1 ∂e ∂e ∂e = · w˜p+1,p + · w˜p,p + · w˜p−1,p (26) ∂sp+1 ∂sp ∂sp−1

= : W˜ p ·δ (27)

where W˜ p · denotes the p-th row of the weight matrix W˜ , which implies

δ(k) = θ0(s(k)) ⊗ (W˜ (k+1)δ(k+1)) (28)

Now we can see that the Eq. (23) is consistent to Eq. (5) and Eq. (28) is a simplified version (without bias) of Eq. (8).

3.4 Multiple Input and Output Feature Maps

The case of multiple input and output feature maps is described in [2]. In particular, for M input channels and N output channels, the forward-propagation becomes

(k) (k) xpq = θ(spq ), (29)

4 (k) (k+1) x N Kernels x θ θ θ N M θ θ

W × D × M K × K × M × N W’ × D’ × N

(k) (k+1) x N Kernels x θ x(k) θ Kernel θ θ N + M θ M θ K × K × M W’ × D’

W × D × M W × D × M K × K × M × N W’ × D’ × N

(a) Multi-channel input and single-channel output (b) Multi-channel input and output x(k) Figure 2: Convolution with multiple input and output feature maps: hereKernelM denotes the number of input channels, W and D denote the spatial shape of the input feature maps, K+denotesθ the size of 0 0 kernels, W and D are the spatial shape of the inputM feature maps, which depends on the padding. K × K × M W’ × D’

W × D × M

(k+1) X X T (k) spqn = wijnxp+i,q+j, n = 1, ··· ,N, (30) i∈K j∈K

(k) (k) M (k+1) N (k+1) where xpq , spq ∈ R , spq ∈ R are the multi-channel pixels, and spqn denotes the n-th channel of the pixel at (p, q). (k) The multiplication between the scalars xp+i,q+j and wij in Eq. (9) becomes the dot between (k) the column vectors xp+i,q+j and wijn. Here each n corresponds to one output feature feature map. The forward-propagation process with multiple channels are illustrated in figure 2a and 2b. When it comes to the back-propagation, we have

∂e ∂e δ(k) = = θ0(s(k)) ⊗ pq (k) pq (k) (31) ∂spq ∂xpq  N (k+1)  X X ∂e ∂sp+i,q+j,n = θ0(s(k)) ⊗ · pq  (k+1) (k)  (32) n=1 i∈K,j∈K ∂sp+i,q+j,n xpq

 N  X X ∂e = θ0(s(k)) ⊗ · w pq  (k+1) −i,−j,n (33) n=1 i∈K,j∈K ∂sp+i,q+j,n

 N  0 (k) X X (k+1) = θ (spq ) ⊗  δp+i,q+j,n · w−i,−j,n (34) i∈K,j∈K n=1   0 (k) X (k+1) = θ (spq ) ⊗  W−i,−jδp+i,q+j (35) i∈K,j∈K where W−i,−j = [w−i,−j,1, w−i,−j,2, ··· , w−i,−j,N ]. From Eq. (32) to Eq. (33), we exchange the order of the two summations. From Eq. (33) to Eq. (34), we re-write the linear combination of vectors into a multiplication between a matrix (consists of the vectors) and a vector. The formulas are very similar to the ones in the fully connected case and imply another connection between the convolution and the fully connected layer, which is, the channels in the two consecutive layers are fully connected to each other.

Acknowledgements

We thank Zhengyang Wang for proof-reading. This work was supported in part by National Science Foundation grants IIS-1908220, IIS-1908198, IIS-1908166, DBI-1147134, DBI-1922969, DBI- 1661289, CHE-1738305, National Institutes of Health grant 1R21NS102828, and Defense Ad- vanced Research Projects Agency grant N66001-17-2-4031.

5 References [1] Yaser S Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data. AML- Book New York, NY, USA:, 2012. [2] Charu C Aggarwal. Neural networks and . Springer, 2018.

6