Learning Transferable Visual Models From Natural Language Supervision

Give a summary of the approach used in CLIP.


paste-39fa726bd182ceab62978bfde49ed398a1c02ba7.jpg Given a batch of NN (image, text) pairs, CLIP learns a multi-model embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the NN correct pairs while minimizing the cosine similarity of the N2NN^2 - N incorrect pairs. (It is optimized using a symmetric cross entropy loss over these similarity scores)

How many (image, text) pairs were collected in the dataset used to train CLIP?


400 million

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.