Denoising Diffusion Probabilistic Models

Give the forward diffusion process.


Given a data point sampled from a real data distribution x0q(x)\mathbf{x}_0 \sim q(\mathbf{x}), let us define a forward diffusion process in which we add small amounts of Gaussian noise to the sample in TT steps, producing a sequence of noisy samples x1,,xT\mathbf{x}_1, \dots, \mathbf{x}_T. The step sizes are controlled by a variance schedule {βt(0,1)}t=1T\{\beta_t \in (0, 1)\}_{t=1}^T q(xtxt1)=N(xt;1βtxt1,βtI)q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) q(x1:Tx0)=t=1Tq(xtxt1)q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) As TT \to \infty, xT\mathbf{x}_T becomes equivalent to an isotropic Gaussian distribution. forward-diffusion.png Forward diffusion process. Image modified by Ho et al. 2020

Why is the mean of the forward diffusion model scaled by 1βt\sqrt{1 - \beta_t}, where βt\beta_t is the variance to xt\mathbf{x}_t? q(xtxt1)=N(xt;1βtxt1,βtI)q(\mathbf{x}_t|\mathbf{x}_{t-1}) = N(\mathbf{x}_t;\sqrt{1-\beta_t}\mathbf{x}_{t-1},\beta_t\mathbf{I})


The scaling factor is needed to avoid making the variance of xt\mathbf{x}_t grow in each step. If we would not scale it, after TT steps we will have a value xt[T,T]\mathbf{x}_t \in [-T,T]. To force Var(x1)=1\mathrm{Var}(x_1) = 1 we need to scale by 1βt\sqrt{1 - \beta_t}.

In the forward diffusion process, how can we go from x0\mathbf{x}_0 to xT\mathbf{x}_T in a single step ? Recall that q(xtxt1)=N(xt;1βtxt1,βtI)q(x1:Tx0)=t=1Tq(xtxt1)q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})


Using the reparameterization trick that tells us: zN(z;μ,σ2I)\mathbf{z} \sim \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}, \boldsymbol{\sigma^2}\boldsymbol{I}) and z=μ+σϵ\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon} (where ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{I}), Reparameterization trick).

If we define αt=1βt\alpha_t = 1 - \beta_t and αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i, then by recursively applying this trick:

xt=αtxt1+1αtϵt1\mathbf{x}_t = \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 - \alpha_t}\boldsymbol{\epsilon}_{t-1} =αtαt1xt2+1αtαt1ϵˉt2= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}} \bar{\boldsymbol{\epsilon}}_{t-2} =αˉtx0+1αˉtϵ= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}

Therefore: q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(\mathbf{x}_t \vert \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I})

Note: When merging two Gaussians N(0,σ12I)\mathcal{N}(\mathbf{0}, \sigma_1^2\mathbf{I}) and N(0,σ22I)\mathcal{N}(\mathbf{0}, \sigma_2^2\mathbf{I}), the result is N(0,(σ12+σ22)I)\mathcal{N}(\mathbf{0}, (\sigma_1^2 + \sigma_2^2)\mathbf{I}). Thus (1αt)+αt(1αt1)=1αtαt1\sqrt{(1 - \alpha_t) + \alpha_t (1-\alpha_{t-1})} = \sqrt{1 - \alpha_t\alpha_{t-1}}.

Give the simplified objective funtion (loss) for diffusion models.


Lsimple=Ltsimple+CL_\text{simple} = L_t^\text{simple} + C where CC is a constant not depending on θ\theta.

The time-dependent loss is: Ltsimple=Et[1,T],x0,ϵt[ϵtϵθ(xt,t)2]L_t^\text{simple} = \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\|^2 \Big]

Substituting xt=αˉtx0+1αˉtϵt\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t: Ltsimple=Et[1,T],x0,ϵt[ϵtϵθ(αˉtx0+1αˉtϵt,t)2]L_t^\text{simple} = \mathbb{E}_{t \sim [1, T], \mathbf{x}_0, \boldsymbol{\epsilon}_t} \Big[\|\boldsymbol{\epsilon}_t - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon}_t, t)\|^2 \Big]

Give the training algorithm for Denoising Diffusion Probabilistic Models.


repeat until convergence: x0q(x0)\mathbf{x}_0 \sim q(\mathbf{x}_0) tUniform({1,,T})t \sim \operatorname{Uniform}(\{1, \dots,T\}) ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I}) Take gradient descent step on θϵϵθ(αˉtx0+1αˉtϵ,t)2\nabla_\theta \|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t}\boldsymbol{\epsilon},t)\|^2

Give the inference algorithm for Denoising Diffusion Probabilistic Models.


xtN(0,I)\mathbf{x}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) for t=T,,1t = T, \dots, 1 do: zN(0,I)\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) if t>1t > 1, else z=0\mathbf{z} = \mathbf{0} xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtz\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}(\mathbf{x}_t - \frac{1 - \alpha_t}{\sqrt{1 - \bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)) + \sigma_t \mathbf{z} return x0\mathbf{x}_0

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.