Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Draw and describe the Diffusion Policy architecture, and explain the observation (ToT_o), prediction (TpT_p) and action (TaT_a) horizons.


dp_policy_input_output.png
At environment time step tt the policy:

  1. Takes the last ToT_o steps of observation, Ot\mathbf{O}_t (images + robot pose).
  2. Starts from a pure-noise action sequence AtKN(0,I)\mathbf{A}_t^K \sim \mathcal{N}(\mathbf{0},\mathbf{I}) of length TpT_p.
  3. Runs KK denoising iterations with noise-prediction network ϵθ(Ot,Atk,k)\boldsymbol{\epsilon}_\theta(\mathbf{O}_t,\mathbf{A}_t^k,k), producing At0\mathbf{A}_t^0, the predicted TpT_p-step action sequence.
  4. Executes only the first TaT_a steps of At0\mathbf{A}_t^0 open-loop, then re-plans (receding-horizon control).

Notation convention: superscript k{K,,0}k \in \{K,\dots,0\} indexes the diffusion iteration (since subscript tt is already taken by env time). Typical values: To=2T_o{=}2, Tp=16T_p{=}16, Ta=8T_a{=}8, K=100K{=}100 training / 1010 inference via DDIM.

Write the conditional denoising update used at inference in Diffusion Policy and explain every symbol.


Atk1=α(Atkγϵθ(Ot,Atk,k)+N(0,σ2I))\mathbf{A}_t^{k-1} = \alpha\Big(\mathbf{A}_t^k - \gamma\,\boldsymbol{\epsilon}_\theta(\mathbf{O}_t,\mathbf{A}_t^k,k) + \mathcal{N}\bigl(\mathbf{0},\sigma^2 \mathbf{I}\bigr)\Big)

Starting from AtKN(0,I)\mathbf{A}_t^K \sim \mathcal{N}(\mathbf{0},\mathbf{I}), this runs for k=K,K1,,1k = K, K{-}1, \dots, 1 to produce the clean action sequence At0\mathbf{A}_t^0.

  • Atk\mathbf{A}_t^k : action sequence of length TpT_p at diffusion iteration kk (noisy for large kk, clean at k=0k{=}0).
  • Ot\mathbf{O}_t : conditioning observation. It is only fed in, never denoised, so the vision encoder runs once per control step regardless of KK.
  • ϵθ(Ot,Atk,k)\boldsymbol{\epsilon}_\theta(\mathbf{O}_t,\mathbf{A}_t^k,k) : noise-prediction network; predicts the noise currently contaminating Atk\mathbf{A}_t^k.
  • γ\gamma : step size on the predicted noise (analogous to a learning rate; see gradient-descent card).
  • σ\sigma : std of the Gaussian noise re-injected to keep the process stochastic (Langevin).
  • α\alpha : overall rescale, typically slightly <1<1 to improve stability (Ho et al. 2020).

The triple (α,γ,σ)(\alpha,\gamma,\sigma) is a function of kk and constitutes the noise schedule (DP uses the square-cosine schedule from iDDPM). It plays the role of learning-rate scheduling.

The standard DDPM update xt1=1αt(xt1αt1αˉtϵθ(xt,t))+σtz\mathbf{x}_{t-1}=\tfrac{1}{\sqrt{\alpha_t}}\bigl(\mathbf{x}_t-\tfrac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\bigr)+\sigma_t\mathbf{z} is the same equation with α ⁣= ⁣1/αt\alpha\!=\!1/\sqrt{\alpha_t} and γ ⁣= ⁣(1αt)/1αˉt\gamma\!=\!(1-\alpha_t)/\sqrt{1-\bar{\alpha}_t} folded into the schedule.

Give the training loss for Diffusion Policy, and explain how the noisy input At0+ϵk\mathbf{A}_t^0 + \boldsymbol{\epsilon}^k is constructed.


L=MSE(ϵk,  ϵθ(Ot,At0+ϵk,k))\mathcal{L} = \mathrm{MSE}\Big(\boldsymbol{\epsilon}^k,\; \boldsymbol{\epsilon}_\theta(\mathbf{O}_t,\, \mathbf{A}_t^0 + \boldsymbol{\epsilon}^k,\, k)\Big)

Per training step:

  1. Sample (Ot,At0)(\mathbf{O}_t, \mathbf{A}_t^0) from the demonstration dataset.
  2. Sample a diffusion iteration kUniform{1,,K}k \sim \operatorname{Uniform}\{1,\dots,K\}.
  3. Sample noise ϵkN(0,I)\boldsymbol{\epsilon}^k \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) scaled by the schedule for step kk.
  4. Form the noisy action At0+ϵk\mathbf{A}_t^0 + \boldsymbol{\epsilon}^k and let ϵθ\boldsymbol{\epsilon}_\theta predict the noise that is added.

Why this works. This is exactly the DDPM ϵ\boldsymbol{\epsilon}-matching loss applied to action sequences conditioned on Ot\mathbf{O}_t. The paper's compact notation At0+ϵk\mathbf{A}_t^0 + \boldsymbol{\epsilon}^k is a shorthand: in the DDPM notation it would be written αˉkAt0+1αˉkϵ\sqrt{\bar{\alpha}_k}\mathbf{A}_t^0 + \sqrt{1-\bar{\alpha}_k}\boldsymbol{\epsilon}, where the variance of ϵk\boldsymbol{\epsilon}^k is set by the schedule at step kk. It has been shown that minimising this simple MSE also minimises the variational lower bound on KL ⁣[pdatapθ]\mathrm{KL}\!\left[\,p_{\text{data}}\,\|\,p_\theta\right], so we get proper density modelling with a plain regression loss.

Because Ot\mathbf{O}_t is only conditioning (never noised), gradients flow through it and the vision encoder is trained end-to-end with ϵθ\boldsymbol{\epsilon}_\theta.

Why can we interpret one step of the diffusion denoising update as noisy gradient descent on an energy landscape, and what does ϵθ\boldsymbol{\epsilon}_\theta represent in this view?


Strip the bias/scale from the update:
xk1    xkγϵθ(xk,k)+noise\mathbf{x}^{k-1} \;\approx\; \mathbf{x}^k - \gamma\,\boldsymbol{\epsilon}_\theta(\mathbf{x}^k,k) + \text{noise}
Compare to gradient descent on some scalar energy E(x)E(\mathbf{x}):
x=xγE(x)\mathbf{x}' = \mathbf{x} - \gamma\,\nabla E(\mathbf{x})
So ϵθ\boldsymbol{\epsilon}_\theta is effectively predicting the gradient field E(x)\nabla E(\mathbf{x}), and one denoising iteration is one step of (noisy) gradient descent toward a local minimum of EE.

Running KK such steps with added Gaussian noise is Stochastic Langevin Dynamics; it samples from p(x)eE(x)p(\mathbf{x}) \propto e^{-E(\mathbf{x})} rather than greedily descending to a single point. Noise lets trajectories hop between basins, which is exactly what lets the policy express multiple action modes instead of collapsing to the mean of the demonstrations.

dp_DP_teaser.png
(c) Diffusion Policy denoises noise into actions by following a learned gradient field.

Why is Diffusion Policy more stable to train than an Implicit Behavioral Cloning (IBC)? Derive the key observation about the normalisation constant.


IBC represents the policy as an Energy-Based Model:
pθ(ao)=eEθ(o,a)Z(o,θ),Z(o,θ)=eEθ(o,a)dap_\theta(\mathbf{a}\mid\mathbf{o}) = \frac{e^{-E_\theta(\mathbf{o},\mathbf{a})}}{Z(\mathbf{o},\theta)}, \qquad Z(\mathbf{o},\theta)=\int e^{-E_\theta(\mathbf{o},\mathbf{a})}\,d\mathbf{a}
Z(o,θ)Z(\mathbf{o},\theta), the integral of eEe^{-E} over the whole action space, is intractable. IBC estimates it with an InfoNCE-style loss using NnegN_{\text{neg}} negative action samples {a~j}\{\tilde{\mathbf{a}}^j\}:
LInfoNCE=logeEθ(o,a)eEθ(o,a)+j=1NnegeEθ(o,a~j)\mathcal{L}_{\text{InfoNCE}} = -\log\frac{e^{-E_\theta(\mathbf{o},\mathbf{a})}}{e^{-E_\theta(\mathbf{o},\mathbf{a})}+\sum_{j=1}^{N_{\text{neg}}} e^{-E_\theta(\mathbf{o},\tilde{\mathbf{a}}^j)}}
Poor negatives -> bad ZZ estimate -> training instability. Empirically IBC's train MSE and eval success both oscillate.

dp_ibc_stability_figure.png

Diffusion Policy sidesteps ZZ entirely by modelling the score function alogp(ao)\nabla_{\mathbf{a}} \log p(\mathbf{a}\mid\mathbf{o}) instead of pp:
alogp(ao)=aEθ(a,o)alogZ(o,θ)=0 since Z doesn’t depend on a    ϵθ(a,o)\nabla_{\mathbf{a}}\log p(\mathbf{a}\mid\mathbf{o}) = -\nabla_{\mathbf{a}} E_\theta(\mathbf{a},\mathbf{o}) - \underbrace{\nabla_{\mathbf{a}} \log Z(\mathbf{o},\theta)}_{=\,0\ \text{since }Z\text{ doesn't depend on }\mathbf{a}} \;\approx\; -\boldsymbol{\epsilon}_\theta(\mathbf{a},\mathbf{o})
The logZ\log Z term vanishes under a\nabla_{\mathbf{a}}, it's a constant w.r.t. a\mathbf{a}. So neither training (MSE on noise) nor inference (Langevin steps) ever touches ZZ, and training is stable.

How does Diffusion Policy end up expressing multimodal action distributions, and where does the multimodality come from?


Multimodality arises from two stochastic sources in the Langevin sampler:

  1. Stochastic initialisation: each rollout starts from a fresh AtKN(0,I)\mathbf{A}_t^K \sim \mathcal{N}(\mathbf{0},\mathbf{I}). Different initial points land in different convergence basins of the (implicit) energy EE.
  2. Injected Gaussian noise per iteration: the N(0,σ2I)\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}) term in the update lets samples hop between basins during the KK denoising steps rather than deterministically rolling into the nearest one.

Because ϵθ\boldsymbol{\epsilon}_\theta learns a gradient field over the whole action space (not a single-mode parametric distribution like a Gaussian or GMM), Stochastic Langevin Dynamics can, in principle, sample any normalisable p(AtOt)p(\mathbf{A}_t\mid\mathbf{O}_t). Combined with action-sequence prediction, this also gives temporal consistency: the whole TpT_p-step chunk is sampled jointly from one mode, so consecutive actions don't alternate between "go left" and "go right".

dp_multimodal_sim.png
Pushing the T-block into the target: either left or right around it is valid. Diffusion Policy commits cleanly to one mode per rollout; LSTM-GMM/IBC are biased, BET jitters between modes.

Compare the CNN-based and Transformer-based Diffusion Policy backbones: how is the observation injected, and when would you use each?


dp_policy_input_output.png

CNN-based (default): a 1-D temporal U-Net over the action sequence. Ot\mathbf{O}_t and the diffusion step kk are injected via FiLM (Feature-wise Linear Modulation): per-channel affine hγ(Ot,k)h+β(Ot,k)\mathbf{h} \leftarrow \boldsymbol{\gamma}(\mathbf{O}_t,k)\odot\mathbf{h} + \boldsymbol{\beta}(\mathbf{O}_t,k) applied at every conv layer. Works out-of-the-box on most tasks with little tuning. Weakness: temporal conv has a low-frequency inductive bias, so it over-smooths fast-changing actions (e.g. velocity control).

Transformer-based (time-series diffusion transformer): noisy actions Atk\mathbf{A}_t^k are the input tokens of a minGPT-style decoder; a sinusoidal embedding of kk is prepended as the first token; an MLP-encoded Ot\mathbf{O}_t is fed via cross-attention in each decoder block; causal self-attention within actions. Output tokens predict ϵθ(Ot,Atk,k)\boldsymbol{\epsilon}_\theta(\mathbf{O}_t,\mathbf{A}_t^k,k). Better on high-frequency / velocity-control tasks but more hyperparameter-sensitive.

Recommendation: start with CNN; switch to the transformer only if the task has rapid, sharp action changes.

What are the key design decisions that make Diffusion Policy practical on a real robot (action space, execution, inference speed)?


  • Position control > velocity control. Surprising, because most BC baselines use velocity. Reasons: (i) position actions are more multimodal. Diffusion Policy handles this well, baselines (GMM, k-means) don't; (ii) position control suffers less from compounding error over long action chunks.
    dp_pos_vs_vel_figure.png
  • Receding-horizon action chunking. Predict TpT_p steps, execute only TaT_a, then replan. Balances temporal consistency (large TaT_a) vs. reactivity (small TaT_a). Ablation shows an interior sweet spot around Ta8T_a{\approx}8.
    dp_ablation_figure.png
  • Visual conditioning, not joint modelling. Model p(AtOt)p(\mathbf{A}_t\mid\mathbf{O}_t) instead of p(At,Ot)p(\mathbf{A}_t,\mathbf{O}_t) (Diffuser-style). → the vision encoder runs once per control step regardless of KK denoising iterations; and it can be trained end-to-end with ϵθ\boldsymbol{\epsilon}_\theta.
  • End-to-end ResNet-18 with two tweaks: spatial-softmax pooling (preserves spatial info) and GroupNorm instead of BatchNorm (stable with EMA weights, which DDPMs use).
  • DDIM for fast inference. DDIM decouples training and inference iteration counts. DP uses K=100K{=}100 training, 1010 inference → ~0.1 s per forward pass on a 3080, enough for real-time closed-loop control.
  • Action normalisation to [1,1][-1,1]. DDPMs clip predictions to [1,1][-1,1] each step, so zero-mean/unit-variance normalisation would make part of action space unreachable.

What is the control-theory sanity check for Diffusion Policy on a linear dynamical system with linear feedback demonstrations, and what does it reveal about the general case?


Take an LTI plant with LQR demonstrations:
st+1=Ast+Bat+wt,at=Kst\mathbf{s}_{t+1} = \mathbf{A}\mathbf{s}_t + \mathbf{B}\mathbf{a}_t + \mathbf{w}_t, \qquad \mathbf{a}_t = -\mathbf{K}\mathbf{s}_t

Single-step prediction (Tp=1T_p{=}1). The MSE-optimal denoiser for L=MSE(ϵk,  ϵθ(st,Kst+ϵk,k))\mathcal{L}=\mathrm{MSE}\bigl(\boldsymbol{\epsilon}^k,\;\boldsymbol{\epsilon}_\theta(\mathbf{s}_t,\,-\mathbf{K}\mathbf{s}_t+\boldsymbol{\epsilon}^k,\,k)\bigr) has closed form
ϵθ(s,a,k)=1σk[a+Ks].\boldsymbol{\epsilon}_\theta(\mathbf{s},\mathbf{a},k) = \tfrac{1}{\sigma_k}\bigl[\mathbf{a} + \mathbf{K}\mathbf{s}\bigr].
Plug into the DDIM update → Langevin converges to the unique global minimum a=Ks\mathbf{a}=-\mathbf{K}\mathbf{s}. ✓

Multi-step prediction (Tp>1T_p{>}1). The optimal denoiser gives at+t=K(ABK)tst\mathbf{a}_{t+t'} = -\mathbf{K}(\mathbf{A}-\mathbf{B}\mathbf{K})^{t'}\mathbf{s}_t, i.e. to predict future actions the policy implicitly learns a (task-relevant) dynamics model by unrolling the closed-loop system.

Takeaway. Even in the simple LTI case, action-sequence prediction forces the network to encode dynamics; in the nonlinear case this becomes harder and inherently multimodal — which is exactly the regime where the diffusion formulation pays off.

LTI = Linear Time-Invariant. A dynamical system whose next state is a linear function of the current state and input, with matrices that do not change over time:
st+1=Ast+Bat+wt\mathbf{s}_{t+1} = \mathbf{A}\mathbf{s}_t + \mathbf{B}\mathbf{a}_t + \mathbf{w}_t

  • A\mathbf{A} (state transition) and B\mathbf{B} (input matrix) are constant.
  • wtN(0,Σw)\mathbf{w}_t \sim \mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_w) is process noise.

LQR = Linear Quadratic Regulator. The optimal controller for an LTI system under a quadratic running cost
J=t(stQst+atRat)J = \sum_t \bigl(\mathbf{s}_t^\top \mathbf{Q}\mathbf{s}_t + \mathbf{a}_t^\top \mathbf{R}\mathbf{a}_t\bigr)
where Q0\mathbf{Q}\succeq 0 penalises state error and R0\mathbf{R}\succ 0 penalises control effort. Minimising JJ yields a linear state-feedback law
at=Kst\mathbf{a}_t = -\mathbf{K}\mathbf{s}_t
with gain K\mathbf{K} obtained by solving the discrete-time Riccati equation.

Why the paper uses this setting. LTI + LQR is the classic "textbook" controllable case: known linear dynamics + quadratic cost → closed-form optimal linear policy. Because the ground-truth policy is simple and known, the optimal denoiser ϵθ\boldsymbol{\epsilon}_\theta can be derived analytically, giving a clean sanity check that Diffusion Policy recovers the right controller in the limit.

Machine Learning Research Flashcards is a collection of flashcards associated with scientific research papers in the field of machine learning. Best used with Anki or Obsidian. Edit MLRF on GitHub.