Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

How does Batch Normalization work?

Batch Normalization normalizes each scalar feature independently, by making it have zero mean and a variance of 1. The normalized values are then scaled and shifted. During training, the batchnorm layer works as follows:

Input: Values of $x$ over a mini-batch: $\mathcal{B} = {x_{1...m}}$ Output: $y_i = \operatorname{BatchNorm}_{\gamma, \beta}(x_i)$ $\mu_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}x_i$ $\sigma^2_{\mathcal{B}} = \frac{1}{m} \sum^m_{i=1}(x_i - \mu_{\mathcal{B}})^2$ $\hat{x}_i = \frac{x_i - \mu_{\mathcal{B}}}{\sqrt{\sigma^2_{\mathcal{B}} + \epsilon}}$ $y_i = \gamma x_i + \beta$

With $\epsilon$ a constant added for numeric stability. And $\gamma$ and $\beta$ learned parameters. During inference, the normalization is done using the population rather than mini-batches. $\hat{x}_i = \frac{x_i - \operatorname{E}[x_i]}{\sqrt{\operatorname{Var}[x_i] + \epsilon}}$

What is different for using Batch Normalization in Convolutions compared to fully connected layers?

For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. In other words, in convolutions we normalize (and scale/shift) per feature map (channel), rather than per individual value.

Why does Batch Normalization have learned scale and shift parameters?

The learned shift and scale parameters $\gamma$ and $\beta$ in $y_k = \gamma_k \hat{x}_k + \beta_k$ enable the full represation power of the neural network.

By setting $\gamma_k = \sqrt{\operatorname{Var}[x_k]}$ and $\beta_k = \operatorname{E}[x_k]$ , we could recover the original values, if that were the optimal thing to do.