High-Resolution Image Synthesis with Latent Diffusion Models

Give an overview of the Latent Diffusion Model (Stable Diffusion architecture).


latent-diffusion-arch.png An encoder E\mathcal{E} is used to compress the input image xRH×W×3\mathbf{x} \in \mathbb{R}^{H \times W \times 3} to a smaller 2D latent vector z=E(x)Rh×w×c\mathbf{z} = \mathcal{E}(\mathbf{x}) \in \mathbb{R}^{h \times w \times c}, where the downsampling rate f=H/h=W/w=2m,mNf=H/h=W/w=2^m, m \in \mathbb{N}. A decoder D\mathcal{D} reconstructs the images from the latent vector: x~=D(z)\tilde{\mathbf{x}} = \mathcal{D}(\mathbf{z}). The diffusion and denoising processes happen on the latent vector z\mathbf{z}. The denoising model is a time-conditioned U-Net, augmented with the cross-attention mechanism to handle flexible conditioning information for image generation. To process yy from various modalities, a domain specific encoder τθ\tau_\theta that projects yy to an intermediate representation τθ(y)RM×dτ\tau_\theta(y) \in \mathbb{R}^{M \times d_\tau} is used.

The cross-attention mechanism is defined as: Attention(Q,K,V)=softmax(QKd)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\Big(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d}}\Big) \cdot \mathbf{V}

where the projections are: Q=WQ(i)φi(zi),K=WK(i)τθ(y),V=WV(i)τθ(y)\mathbf{Q} = \mathbf{W}^{(i)}_Q \cdot \varphi_i(\mathbf{z}_i), \quad \mathbf{K} = \mathbf{W}^{(i)}_K \cdot \tau_\theta(y), \quad \mathbf{V} = \mathbf{W}^{(i)}_V \cdot \tau_\theta(y)

with dimensions: WQ(i)Rd×dϵi,WK(i),WV(i)Rd×dτ,φi(zi)RN×dϵi,τθ(y)RM×dτ\mathbf{W}^{(i)}_Q \in \mathbb{R}^{d \times d^i_\epsilon}, \quad \mathbf{W}^{(i)}_K, \mathbf{W}^{(i)}_V \in \mathbb{R}^{d \times d_\tau}, \quad \varphi_i(\mathbf{z}_i) \in \mathbb{R}^{N \times d^i_\epsilon}, \quad \tau_\theta(y) \in \mathbb{R}^{M \times d_\tau}

where φ(zi)RN×dϵi\varphi(\mathbf{z}_i) \in \mathbb{R}^{N \times d_\epsilon^i} denotes a (flattened) intermediate representation of the UNet implementation ϵθ\epsilon_\theta.

How is the latent diffusion model trained?


The compression part (e.g. encoder E\mathcal{E} and decoder D\mathcal{D}) and diffusion part are trained in different phases (first compression then diffusion).

Usually the compression part is taken from a pretrained network such as CLIP.

How does the latent diffusion model avoid arbitratily high-variance latent spaces?


It proposes two variants to deal with this:

  1. KL-reg: a small Kullback-Leibler penalty is imposed towards a standard normal distribution over the learned latent, similar to a VAE.
  2. VQ-reg: a vector quantization layer is used within the decoder, similar to VQVAE but the quantization layer is absorbed by the decoder.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.