Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
How does Batch Normalization work?
Batch Normalization normalizes each scalar feature independently, by making it have zero mean and a variance of 1. The normalized values are then scaled and shifted. During training, the batchnorm layer works as follows:
Input: Values of over a mini-batch: Output:
With a constant added for numeric stability. And and learned parameters. During inference, the normalization is done using the population rather than mini-batches.
What is different for using Batch Normalization in Convolutions compared to fully connected layers?
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a minibatch, over all locations. In other words, in convolutions we normalize (and scale/shift) per feature map (channel), rather than per individual value.
Why does Batch Normalization have learned scale and shift parameters?
The learned shift and scale parameters and in enable the full represation power of the neural network.
By setting and , we could recover the original values, if that were the optimal thing to do.