Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Draw the architecture of the SEgmentation TRansformer (SETR).


setr-1.png Schematic illustration of the proposed SEgmentation TRansformer (SETR) (a). We first split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. To perform pixel-wise segmentation, we introduce different decoder designs: (b) progressive upsampling (resulting in a variant called SETR-PUP); and (c) multi-level feature aggregation (a variant called SETR-MLA).

What is the PUP decoder in SETR?


Progressive UPsampling (PUP) figure1_b.png

PUP uses a progressive upsampling strategy that alternates conv layers and upsampling operations (upsampling is restricted to 2×2 \times). Hence a total of 4 operations are needed to reach the full resolution from patches with size H16W16\frac{H}{16} \frac{W}{16}. When using this decoder, the models is denoted as SETR-PUP.

What is the MLA decoder in SETR?


Multi-Level feature Aggregation (MLA) figure1_c.png

MLA takes as input feature representations {Zm}(m{LeM,2LeM,...,MLeM})\{Z^m\} (m \in \{\frac{L_e}{M}, 2\frac{L_e}{M}, ..., M\frac{L_e}{M}\}) from MM layers uniformly distributed across the layers with step LeM\frac{L_e}{M} to the decoder. MM streams are then deployed, with each focusing on one specific selected layer. A 3-layer network is applied with the feature channels halved at the first and third layers respectively, and the spatial resolution upscaled 4×4 \times by bilinear operation after the third layer. To enhance the interactions across different streams, we introduce a top-down aggregation design via element-wise addition after the first layer. An additional 3×33 \times 3 conv is applied after the element-wise additioned feature. After the third layer, we obtain the fused feature from all the streams via channel-wise concatenation which is then bilinearly upsampled 4×4 \times to the full resolution. When using this decoder, we denote our model as SETR-MLA.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.