Learning Transferable Visual Models From Natural Language Supervision

Give a summary of the approach used in CLIP.

Given a batch of $N$ (image, text) pairs, CLIP learns a multi-model embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the $N$ correct pairs while minimizing the cosine similarity of the $N^2 - N$ incorrect pairs. (It is optimized using a symmetric cross entropy loss over these similarity scores)

How many (image, text) pairs were collected in the dataset used to train CLIP?

400 million