A Simple Framework for Contrastive Learning of Visual Representations

Give schematic of the contrastive learning framework used in SimCLR.


SimCLR.png

Framework for contrastive learning of visual representations. Two separate data augmentations operators are sampled from the same family of augmentations (t,tTt, t' \sim \mathcal{T}) and applied to each data example to obtain two correlated views. A base encoder network f(.)f(.) and a projection head g(.)g(.) are trained to maximize agreement using a contrastive loss. After training is completed, throw away the projection head g(.)g(.) and use encoder f(.)f(.) and representation h\mathbf{h} for downstream tasks.

Which similarity metric is used in SimCLR?


Cosine similarity This can be represented by using a dot product and scaling by the magnitudes. s(u,v)=uTvuvs(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^T\mathbf{v}}{\|u\| \|v\|}

Which loss function is used in SimCLR?


The loss function for a positive pair of examples (i,j)(i, j) is defined as: LSimCLR(i,j)=logexp(s(zi,zj)/τ)k=12N1[ki]exp(s(zi,zk)/τ)\mathcal{L}_\text{SimCLR}^{(i,j)} = - \log\frac{\exp(s(\mathbf{z}_i, \mathbf{z}_j) / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(s(\mathbf{z}_i, \mathbf{z}_k) / \tau)} where s(.)s(.) is the similarity metric (usually cosine similarity). The final loss is computed across all positive pairs, both (i,j)(i,j) and (j,i)(j,i).

This loss can be called the normalized temperature-scaled cross entropy loss (NT-Xent). It has been used in prior work.

Give the training algorithm for SimCLR.


input: batch size NN, temperature constant τ\tau, encoder ff, projection head gg, augmentation family T\mathcal{T}. for sampled minibatch {xk}k=1N\{\mathbf{x}_k\}^N_{k=1} do: for all k{1,,N}k \in \{1, \dots, N\} do: sample two augmentation functions tTt \sim \mathcal{T}, tTt' \sim \mathcal{T} x~2k1=t(xk)\tilde{\mathbf{x}}_{2k - 1}= t(\mathbf{x}_k) x~2k=t(xk)\tilde{\mathbf{x}}_{2k}= t'(\mathbf{x}_k) h2k1=f(x~2k1)\mathbf{h}_{2k - 1}= f(\tilde{\mathbf{x}}_{2k -1 }) h2k=f(x~2k)\mathbf{h}_{2k}= f(\tilde{\mathbf{x}}_{2k}) z2k1=g(h2k1)\mathbf{z}_{2k-1} = g(\mathbf{h}_{2k-1}) z2k=g(h2k)\mathbf{z}_{2k} = g(\mathbf{h}_{2k}) for all i{1,,2N}i \in \{1, \dots, 2N\} and j{1,,2N}j \in \{1, \dots, 2N\} do: si,j=zizjzizjs_{i,j} = \frac{\mathbf{z}_i^\top\mathbf{z}_j}{\|\mathbf{z}_i\| \|\mathbf{z}_j\|} define L(i,j)=logexp(si,j/τ)k=12N1[ki]exp(si,k/τ)\mathcal{L}^{(i,j)} = - \log\frac{\exp(s_{i,j} / \tau)}{\sum_{k=1}^{2N} \mathbb{1}_{[k \neq i]} \exp(s_{i,k} / \tau)} L=12Nk=1N[L(2k1,2k)+L(2k,2k1)]\mathcal{L} = \frac{1}{2N} \sum^N_{k=1}[\mathcal{L}^{(2k-1,2k)} +\mathcal{L}^{(2k,2k-1)}] update networks ff and gg to minimize L\mathcal{L} return encoder ff and throw away gg

In contrastive frameworks such as SimCLR, why is the similarity optimized on a separate projection head gg?


It likely due to the fact that the contrastive representation needs to be invariant to many data transformations, as such information such as color is removed in this representation while this may be useful for downstream tasks. By adding an additional projection head, gg can remove information that may be useful for downstream tasks but needs to be removed in order to maximize the contrastive similarity.However all of this is found empirically.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.