Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

How does Batch Normalization work?


Batch Normalization normalizes each scalar feature independently, by making it have zero mean and a variance of 1. The normalized values are then scaled and shifted. During training, the batchnorm layer works as follows:


Input: Values of xx over a mini-batch: B=x1...m\mathcal{B} = {x_{1...m}} Output: yi=BatchNormγ,β(xi)y_i = \operatorname{BatchNorm}_{\gamma, \beta}(x_i) μB=1mi=1mxi\mu_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}x_i σB2=1mi=1m(xiμB)2\sigma^2_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}(x_i - \mu_{\mathcal{B}})^2 x^i=xiμBσB2+ϵ\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}} yi=γxi+βy_i = \gamma x_i + \beta


With ϵ\epsilon a constant added for numeric stability. And γ\gamma and β\beta learned parameters. During inference, the normalization is done using the population rather than mini-batches. x^i=xiE[xi]Var[xi]+ϵ\hat{x}_i = \frac{x_i - \operatorname{E}[x_i]}{\sqrt{\operatorname{Var}[x_i] + \epsilon}}

What is different for using Batch Normalization in Convolutions compared to fully connected layers?


For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. In other words, in convolutions we normalize (and scale/shift) per feature map (channel), rather than per individual value.

Why does Batch Normalization have learned scale and shift parameters?


The learned shift and scale parameters γ\gamma and β\beta in yk=γkx^k+βky_k = \gamma_k \hat{x}_k + \beta_k enable the full represation power of the neural network.

By setting γk=Var[xk]\gamma_k = \sqrt{\operatorname{Var}[x_k]} and βk=E[xk]\beta_k = \operatorname{E}[x_k], we could recover the original values, if that were the optimal thing to do.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.