DETR: End-to-End Object Detection with Transformers

Draw the architecture of DETR.


paste-132b224ca8ac9dcdf056c2b7737a8ffb47c0e9c1.jpgpaste-3a35091f801c1a07f766a113cc4bf98fe9d7225f.jpg

While DETR has a better AP than previous CNN based object detection algorithms, in which aspects is it worse than those models?


DETR performs worse on small objects. DETR requires extra-long training.

Why is DETR called an End-to-End detector?


DETR predicts all objects at once, without an intermediate step such as non-maximal suppression.

This is done using a set loss function which performs bipartite matching between predicted and ground-truth objects.

How does DETR match predictions with ground-truth?


DETR uses bipartite matching between predicted and ground truth objects.

Let us denote by yy the ground truth set of objects, and y^={y^i}i=1N\hat{y} = \{\hat{y}_i\}_{i=1}^{N} the set of NN predictions.

Assuming NN is larger than the number of objects in the image,

we consider yy also as a set of size NN padded with \emptyset (no object).

To find a bipartite matching between these two sets we search for a permutation of NN elements σΣN\sigma \in \Sigma_N with the lowest cost:

σ^=argminσΣNiNLmatch(yi,y^σ(i)),\hat{\sigma} = \text{argmin}_{\sigma\in\Sigma_N} \sum_{i}^{N} L_{match}(y_i, \hat{y}_{\sigma(i)}),

where Lmatch(yi,y^σ(i))\cal{L}_{match}(y_i, \hat{y}_{\sigma(i)}) is a pair-wise matching cost between ground truth yiy_i and a prediction with index σ(i)\sigma(i). This optimal assignment is computed efficiently with the Hungarian algorithm.

The matching cost takes into account both the class prediction and the similarity of predicted and ground truth boxes. Each element ii of the ground truth set can be seen as a yi=(ci,bi)y_i = (c_i, b_i) where cic_i is the target class label (which may be \emptyset) and bi[0,1]4b_i \in [0, 1]^4 is a vector that defines ground truth box center coordinates and its height and width relative to the image size. For the prediction with index σ(i)\sigma(i) we define probability of class cic_i as p^σ(i)(ci)\hat{p}_{\sigma(i)}(c_i) and the predicted box as b^σ(i)\hat{b}_{\sigma(i)}. With these notations we define

Lmatch(yi,y^σ(i))\cal{L}_{match}(y_i, \hat{y}_{\sigma(i)}) as 1{ci}p^σ(i)(ci)+1{ci}Lbox(bi,b^σ(i))-\mathbb{1}_{\{c_i\neq\emptyset\}}\hat{p}_{\sigma(i)}(c_i) + \mathbb{1}_{\{c_i\neq\emptyset\}} \cal{L}_{box}(b_{i}, \hat{b}_{\sigma(i)}).

Which loss function is used in DETR?


The Hungarian loss.

Which is a linear combination of a negative log-likelihood for class prediction and a box loss:

LHungarian(y,y^)=i=1N[logp^σ^(i)(ci)+1{ci}Lbox(bi,b^σ^(i))]\cal{L}_{Hungarian}(y, \hat{y}) = \sum_{i=1}^N \left[-\log \hat{p}_{\hat{\sigma}(i)}(c_{i}) + \mathbb{1}_{\{c_i\neq\emptyset\}} \cal{L}_{box}(b_{i}, \hat{b}_{\hat{\sigma}}(i))\right]

where Lbox(bi,b^σ^(i))=λiouLiou(bi,b^σ(i))+λL1bib^σ(i)1\cal{L}_{box}(b_{i}, \hat{b}_{\hat{\sigma}}(i)) = \lambda_{\rm iou}\cal{L}_{iou}(b_{i}, \hat{b}_{\sigma(i)}) + \lambda_{\rm L1}||b_{i}- \hat{b}_{\sigma(i)}||_1 with Liou\cal{L}_{iou} the generalized IoU loss and σ^\hat{\sigma} the optimal assignment computed with the Hungarian algorithm.

How does **DETR **produce N predictions.


The predictions come from the transformer decoder.

The decoder follows the standard architecture of the transformer, transforming NN embeddings of size dd using multi-headed self- and encoder-decoder attention mechanisms. The difference with the original transformer is that our model decodes the NN objects in *parallel *at each decoder layer.

Since the decoder is also permutation-invariant, the NN input embeddings must be different to produce different results. These input embeddings are learnt positional encodings that we refer to as object queries, and similarly to the encoder, we add them to the input of each attention layer.

The NN object queries are transformed into an output embedding by the decoder. They are then *independently *decoded into box coordinates and class labels by a feed forward network, resulting NN final predictions. Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.

How many FLOPS and parameters does the DETR model have? And how accurate is it on COCO?


86G FLOPS and 41M parameters with AP 42.0 on COCO.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.