
Training Neural Networks Batch Normalization M. Soleymani Sharif University of Technology Spring 2020 Most slides have been adapted from Bhiksha Raj, 11-785, CMU 2019 and some from Fei Fei Li et. al, cs231n, Stanford 2019 Before going to the generalization We now see an important technique that helps to speed up training while also acts as a form of regularization Covariate shift • Shifting input distributions (covariate shift) • change in the distributions of internal nodes of a deep network (internal covariate shift) – Bach-norm reduces internal covariate shift The problem of covariate shifts • Training assumes the training data are all similarly distributed – Mini-batches have similar distribution? • In practice, each mini-batch may have a different distribution – A “covariate shift” • Covariate shifts can be large! – All covariate shifts can affect training badly 4 Solution: Move all subgroups to a “standard” location • “Move” all batches to have a mean of 0 and unit standard deviation – Eliminates covariate shift between batches 5 Solution: Move all subgroups to a “standard” location • “Move” all batches to have a mean of 0 and unit standard deviation – Eliminates covariate shift between batches 6 Solution: Move all subgroups to a “standard” location • “Move” all batches to have a mean of 0 and unit standard deviation – Eliminates covariate shift between batches 7 Solution: Move all subgroups to a “standard” location • “Move” all batches to have a mean of 0 and unit standard deviation – Eliminates covariate shift between batches 8 Solution: Move all subgroups to a “standard” location • “Move” all batches to have a mean of 0 and unit standard deviation – Eliminates covariate shift between batches 9 Solution: Move all subgroups to a “standard” location • “Move” all batches to have a mean of 0 and unit standard deviation – Eliminates covariate shift between batches – Then move the entire collection to the appropriate location and give it an appropriate variance 10 Why this is a problem with neural networks? Limits the amount to which updating weights in earlier layers can affect the distribution of values that the subsequent layers see �[%] [&] ['] �[#] � � �[(] The values become more stable to stand later layers on [Andrew Ng, Deep Learning Specialization, 2017] © 2017 Coursera Inc. Why this is a problem with neural networks? Limits the amount to which updating weights in earlier layers can affect the distribution of values that the subsequent layers see �[%] [&] ['] �[#] � � �[(] the distribution of its input changes with each step The values become more stable to stand later layers on [Andrew Ng, Deep Learning Specialization, 2017] © 2017 Coursera Inc. Normalizing inputs • On the training set compute mean of each input (feature) ∑0 - / – � = /12 . * 3 7 0 / ∑/12 - 56. – �% = . * 3 • Remove mean: from each of the input mean of the corresponding input – �* ← �* − �* • Normalize variance: -. – �* ← ;. • The network training converges faster if its inputs are normalized to have zero means and unit variances Batch normalization • “you want unit Gaussian activations? just make them so.” �[A] �[A] = ������ [A] � �[A] [=] • Can we normalize � too? × �[A5#] �[A] [=] • Normalizing � is much more common �[#] [#] � �[#] • Makes neural networks more robust to the × range of hyperparameters �[%] � – Much more easily train a very deep network Batch normalization �# + + + � �% + + 1 1 1 • Batch normalization is a covariate adjustment unit that happens after the weighted addition of inputs but before the application of activation – Is done independently for each unit, to simplify computation • Training: The adjustment occurs over individual mini-batches 15 Batch normalization [Ioffe and Szegedy, 2015] �(*)[=] �[=] = I � * N * = = % [=] � − � �% = I �: size of batch � * (*)[=] [=] * = � − � �OPQR = N �%[=] + � • Problem: do we necessarily want a unit Gaussian input to an activation layer? (O)[=] = * = = �̂ = � �OPQR + � � = and � = are learnable parameters of the model Batch normalization [Ioffe and Szegedy, 2015] (*)[=] [=] * = � − � In the linear area of tanh �OPQR = N �%[=] + � • Allow the network to squash the range if it wants to: (*)[=] = * = = �̂ = � �OPQR + � The network can learn to recover the identity: N � = = �%[=] + � ⇒ �̂(*)[=] = �(*)[=] � = = �[=] Batch normalization �# N �Y = I �*Y�Y + � �% Y Batch normalization �Y �Ŷ � �̂ �Y + �Y Y �\5# Covariate shift to Shift to right standard position position Minibatch size �\ Mini-batch mean N N 1 1 % (O) (O) (O) (O) % (O) O �Y − �Y �̂ = �Y� + �Y �Y = I � � = I � − �Y Y Y � Y Y � Y �Y = O∈R^ O∈R^ �Y Mini-batch standard deviation Neuron-specific terms for each batch • BN aggregates the statistics over a minibatch and normalizes the batch by them • Normalized instances are “shifted” to a unit-specific location 18 Batch normalization �# �% Batch normalization �Y �Ŷ � �̂ �Y + �Y Y �\5# �\ (O) (O) (O) O �Y − �Y �̂ = �Y� + �Y � � = Y Y # Y � 1 Y �% �Y �Y �Y �Ŷ � �Ŷ �Y + �Y + �\5# �\ Batchnorm is a vector function over the minibatch �(#) �b(#) �(%) �b(%) ` ` ` ` ` ` ` ` ` ` (a) �(a) �b • Batch normalization is really a vector function applied over all the inputs from a minibatch • To compute the derivative of the loss w.r.t any �(*), we must consider all �̂s in the batch 20 Batchnorm Batch norm (#) �(#) � (%) (%) � � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) • The complete dependency figure for Batch-norm • You can use vector function differentiation rules to compute the derivatives 21 Batch normalization: Backpropagation ����� ����� = �′ �̂ ��̂ �� � Batch normalization �̂ + � � �̂ � � − �a � = �̂ = �� + � N % �a + � 22 Batch normalization: Backpropagation ����� ����� = Parameters to be �� ��̂ learned ����� ����� = � �� ��̂ ����� ����� = �′ �̂ ��̂ �� � �̂ + � � �̂ � Batch normalization � − �a � = �̂ = �� + � N % �a + � 23 Batch normalization: Backpropagation ����� ����� = Parameters to be �� ��̂ learned ����� ����� = � �� ��̂ ����� ����� = � ����� ����� �� ��̂ = �′ �̂ ��̂ �� � �̂ + � � �̂ � Batch normalization � − �a � = �̂ = �� + � N % �a + � 24 Batch normalization: Backpropagation j=Pkk • Final step of backprop: compute jl � �̂ + � � �̂ � Batch normalization � − �a � = �̂ = �� + � N % �a + � 25 Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = (O) ⋅ (O) + % ⋅ (O) + ⋅ (O) �� �� �� ��a �� ��a �� Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = (O) ⋅ (O) + % ⋅ (O) + ⋅ (O) �� �� �� ��a �� ��a �� (O) � − �a ����� 1 �(O) = ⋅ N �% + � ��(O) N % a �a + � Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = (O) ⋅ (O) + % ⋅ (O) + ⋅ (O) �� �� �� ��a �� ��a �� N (O) (O) ����� ����� �� (O) � − �a = I . � = % (O) % N % ��a �� �a �a + � O∈R^ N −1 5& ����� = (�% + �) q% I �(O) − � 2 a ��(O) a O∈R^ Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = (O) ⋅ (O) + % ⋅ (O) + ⋅ (O) �� �� �� ��a �� ��a �� �(O) − � �(O) = a N % �a + � N ����� −1 5& ����� = (�% + �) q% I �(O) − � ��% 2 a ��(O) a a O∈R^ Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = ⋅ + ⋅ + ⋅ � − �a (O) (O) (O) % (O) (O) �(O) = �� �� �� ��a �� ��a �� N % �a + � N % (O) 1 % �� 2 � − � �% = I �(O) − � a = a a � a ��(O) � O∈R^ Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � ���� � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = (O) ⋅ (O) + % ⋅ (O) + ⋅ (O) �� �� �� ��a �� ��a �� �(O) − � (O) a N � = O % N % ����� ����� �� ����� ��a �a + � = I O ⋅ + % ⋅ N ��a �� ��a ��a ��a 1 O∈R^ % (O) % N N (O) �a = I � − �a ����� −1 ����� ∑ −2 � − � � = I ⋅ + ⋅ O∈R^ a O∈R^ ��(O) N % ��% � O∈R^ �a + � a Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � ���� � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) ����� ����� ��(O) ����� ��% ����� �� = ⋅ + ⋅ a + ⋅ a (O) (O) (O) % (O) (O) Second term �� �� �� ��a �� ��a �� goes to 0 �(O) − � �(O) = a N % �a + � N ����� ����� −1 ����� ∑N −2 �(O) − � N = I ⋅ + ⋅ O∈R^ a 1 % (O) N % % �% = I �(O) − � ��a �� � + � ��a � a � a O∈R^ a O∈R^ Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � ���� � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = (O) ⋅ (O) + % ⋅ (O) + ⋅ (O) �� �� �� ��a �� ��a �� (O) N (O) � − �a ����� ����� −1 � = = I ⋅ N % (O) N � + � ��a �� % a O∈R^ �a + � Batch normalization: Backpropagation Batch norm (#) �(#) � (%) (%) � � ` � ` ` ` ` ` ` % ` ` � ` �(a) �(a) (O) % ����� ����� �� ����� ��a ����� ��a (O) = (O) ⋅ (O) + % ⋅ (O) + ⋅ (O) �� �� �� ��a �� ��a �� (O) � − � ��a 1 �(O) = a = N % ��(O) � �a + � N 1 � = I �(O) a � O∈R^ Batch normalization: Backpropagation N ����� −1 5& ����� = (�% + �) q% I �(O) − � ��% 2 a ��(O) a a O∈R^ N (O) % ����� ����� −1 ����� ����� �� ����� ��a ����� ��a = I ⋅ = ⋅ + ⋅ + ⋅ (O) N (O) (O) (O) % (O) (O) ��a �� % �� �� �� ��a �� ��a �� O∈R^ �a + � ����� ����� 1 ����� 2 �(O) − � ����� 1 = ⋅ + ⋅ a + ⋅ ��(O) ��(O) N % ��% � �� � �a + � a a � �̂ + � � �̂ � N Batch normalization 1 (O) �a = I � � (O) O∈R^ (O) � − �a N � = (O) (O) N % �̂ = �� + � % 1 (O) % �a + � �a = I � − �a � 35 O∈R^ Batch normalization: Backpropagation N ����� −1 5& ����� = (�% + �) q% I �(O) − � ��%
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages42 Page
-
File Size-