Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Draw the architecture of the SEgmentation TRansformer (SETR).

Schematic illustration of the proposed SEgmentation TRansformer (SETR) (a). We first split an image into fixed-size patches, linearly embed each of them, add position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder. To perform pixel-wise segmentation, we introduce different decoder designs: (b) progressive upsampling (resulting in a variant called SETR-PUP); and (c) multi-level feature aggregation (a variant called SETR-MLA).

What is the PUP decoder in SETR?

Progressive UPsampling (PUP)

PUP uses a progressive upsampling strategy that alternates conv layers and upsampling operations (upsampling is restricted to $2 \times$ ). Hence a total of 4 operations are needed to reach the full resolution from patches with size $\frac{H}{16} \frac{W}{16}$ . When using this decoder, the models is denoted as SETR-PUP.

What is the MLA decoder in SETR?

Multi-Level feature Aggregation (MLA)

MLA takes as input feature representations $\{Z^m\} (m \in \{\frac{L_e}{M}, 2\frac{L_e}{M}, ..., M\frac{L_e}{M}\})$ from $M$ layers uniformly distributed across the layers with step $\frac{L_e}{M}$ to the decoder. $M$ streams are then deployed, with each focusing on one specific selected layer. A 3-layer network is applied with the feature channels halved at the first and third layers respectively, and the spatial resolution upscaled $4 \times$ by bilinear operation after the third layer. To enhance the interactions across different streams, we introduce a top-down aggregation design via element-wise addition after the first layer. An additional $3 \times 3$ conv is applied after the element-wise additioned feature. After the third layer, we obtain the fused feature from all the streams via channel-wise concatenation which is then bilinearly upsampled $4 \times$ to the full resolution. When using this decoder, we denote our model as SETR-MLA.