本文最后更新于：2025年9月22日下午

DDPM

Diffusion model

Reverse process (reverse from \(X_T\) to \(X_0\))

Starting from \(p(X_T)=N(X_T;\textbf{0},\textbf{I})\), which is a standard normal distribution

\(p_{\theta}(X_{0:T}):=p(X_T)\prod_{t=1}^Tp_{\theta}(X_{t-1}|X_t)\), where \(p_{\theta}(X_{t-1}|X_t):=N(X_{t-1};\mu_{\theta}(X_t,t),\Sigma_{\theta}(X_t,t))\)

every reverse process is a normal distribution from the previous timestep, whose mean and variation is a function of the previous timestep

Forward process (diffusion process)

\(q(X_{1:T}|X_0):=\prod_{t=1}^Tq(X_t|X_{t-1})\), where \(q(X_t|X_{t-1}):=N(X_t;\sqrt{1-\beta_t}X_{t-1},\beta_t\textbf{I})\)

This process gradually adds gaussian noise to \(X_0\) according to \(\beta_t\)

Property: closed-form marginal \[ q(X_t|X_0)=N(X_t;\sqrt{\bar{\alpha}_t}X_0,(1-\bar{\alpha}_t)\textbf{I}) \] where \(\alpha_t=1-\beta_t,\bar{\alpha}_t=\prod_{s=1}^t\alpha_s\)

Proof: \[ q(X_t|X_{t-1})=\sqrt{1-\beta_t}X_{t-1}+\sqrt{\beta_t}\varepsilon=\sqrt{\alpha_t}X_{t-1}+\sqrt{1-\alpha_t}\varepsilon\\ q(X_t|X_{t-2})=\sqrt{\alpha_t}\sqrt{\alpha_{t-1}}X_{t-2}+\sqrt{1-\alpha_t}\varepsilon+\sqrt{\alpha_t(1-\alpha_{t-1})}\varepsilon\\ q(X_t|X_0)=\prod_{s=1}^t\sqrt{\alpha_s}X_{0}+V\varepsilon\\ V^2=\sum_{s=1}^t(1-\alpha_s)\prod_{k=s+1}^t\alpha_k=1-\prod_{s=1}^t\alpha_s \]

\(\square\)

Loss function

the variational bound of negative log likelihood \[ -\text{log}\ p_{\theta}(X_0)=-\text{log} \mathbb{E}_{q}\left[\frac{p_{\theta}(X_{0:T})}{q(X_{1:T}|X_0)}\right]\leq\mathbb{E}_{q}\left[-\text{log} \frac{p_{\theta}(X_{0:T})}{q(X_{1:T}|X_0)}\right]=\mathbb{E}_{q}\left[-\text{log}\ p(X_T)-\sum_{t\geq1}\text{log}\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_t|X_{t-1})}\right]:=L \] , where the first inequality comes from Jensen inequality, and the second equality comes from simply substituting.

Can be rewritten to \[ L=\mathbb{E}_q\left[D_{KL}(q(X_T|X_0)||p(X_T))+\sum_{t>1}D_{KL}(q(X_{t-1}|X_t,X_0)||p_{\theta}(X_{t-1}|X_t))-\text{log}\ p_{\theta}(X_0|X_1)\right] \] , and define \[ L_T:=D_{KL}(q(X_T|X_0)||p(X_T))\\ L_{t-1}:=D_{KL}(q(X_{t-1}|X_t,X_0)||p_{\theta}(X_{t-1}|X_t))\\ L_0:=-\text{log}\ p_{\theta}(X_0|X_1) \] Proof: \[ \begin{aligned} L&=\mathbb{E}_{q}\left[-\text{log}\ p(X_T)-\sum_{t\geq1}\text{log}\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_t|X_{t-1})}\right]\\ &=\mathbb{E}_{q}\left[-\text{log}\ p(X_T)-\sum_{t>1}\text{log}\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_t|X_{t-1})}-\text{log}\ \frac{p_{\theta}(X_{0}|X_1)}{q(X_1|X_{0})}\right]\\ &=\mathbb{E}_{q}\left[-\text{log}\ p(X_T)-\sum_{t>1}\text{log}\left(\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_{t-1}|X_{t},X_0)}\frac{q(X_{t-1}|X_0)}{q(X_t|X_0)}\right)-\text{log}\ \frac{p_{\theta}(X_{0}|X_1)}{q(X_1|X_{0})}\right]\\ &=\mathbb{E}_{q}\left[-\text{log}\ p(X_T)-\sum_{t>1}\text{log}\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_{t-1}|X_{t},X_0)}-\sum_{t>1}\text{log}\frac{q(X_{t-1}|X_0)}{q(X_t|X_0)}-\text{log}\ \frac{p_{\theta}(X_{0}|X_1)}{q(X_1|X_{0})}\right]\\ &=\mathbb{E}_{q}\left[-\text{log}\ p(X_T)-\sum_{t>1}\text{log}\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_{t-1}|X_{t},X_0)}-\text{log}\frac{q(X_{1}|X_0)}{q(X_T|X_0)}-\text{log}\ \frac{p_{\theta}(X_{0}|X_1)}{q(X_1|X_{0})}\right]\\ &=\mathbb{E}_{q}\left[-\text{log}\ \frac{p(X_T)}{q(X_T|X_0)}-\sum_{t>1}\text{log}\frac{p_{\theta}(X_{t-1}|X_t)}{q(X_{t-1}|X_{t},X_0)}-\text{log}\ p_{\theta}(X_{0}|X_1)\right]\\ \end{aligned} \] then according to KL divergence's definition, QED

\(\square\)

here, \[ q(X_{t-1}|X_t,X_0)=N(X_{t-1};\tilde{\mu}_t(X_t,X_0),\tilde{\beta}_t\textbf{I})\\ \tilde{\mu}_t(X_t,X_0):=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}X_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}X_t,\tilde{\beta}_t:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t \] Proof: \[ q(X_{t-1}|X_t,X_0)=\frac{q(X_{t-1},X_t|X_0)}{q(X_{t}|X_0)}=\frac{q(X_t|X_{t-1},X_0)q(X_{t-1}|X_0)}{q(X_{t}|X_0)}\propto q(X_t|X_{t-1},X_0)q(X_{t-1}|X_0)=q(X_t|X_{t-1})q(X_{t-1}|X_0) \] Then \[ q(X_t|X_{t-1})\sim N(X_t;\sqrt{1-\beta_t}X_{t-1},\beta_t\textbf{I})\\ q(X_{t-1}|X_0)\sim N(X_{t-1};\sqrt{\bar{\alpha}_{t-1}}X_0,(1-\bar{\alpha}_{t-1})\textbf{I}) \] So \[ \text{log}\ q(X_t|X_{t-1})=-\frac{1}{2\beta_t} ||X_t-\sqrt{\alpha_t}X_{t-1}||^2+c_1\\ \text{log}\ q(X_{t-1}|X_0)=-\frac{1}{2(1-\bar{\alpha}_{t-1})} ||X_{t-1}-\sqrt{\bar{\alpha}_{t-1}}X_0||^2+c_2 \] We can thus get \[ \text{log}\ q(X_{t-1}|X_t,X_0)\propto \frac{||X_t-\sqrt{\alpha_t}X_{t-1}||^2}{\beta_t}+\frac{||X_{t-1}-\sqrt{\bar{\alpha}_{t-1}}X_0||^2}{1-\bar{\alpha}_{t-1}}\\ = \frac{||X_t||^2+{\alpha_t}||X_{t-1}||^2-2\sqrt{\alpha_t}X_t^\top X_{t-1}}{\beta_t}+\frac{||X_{t-1}||^2-2\sqrt{\bar{\alpha}_{t-1}}X_{t-1}^\top X_0+\bar{\alpha}_{t-1}||X_0||^2}{1-\bar{\alpha}_{t-1}} \] then we can obtain the quadratic term of \(||X_{t-1}||^2\) is \(\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right)||X_{t-1}||^2\), and linear term is \(-2X_{t-1}^\top\left(\frac{\sqrt{\alpha_t}}{\beta_t}X_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}X_0\right)\), pdf of this kind is normal distribution

As a result, \[ \tilde{\beta}_t=\left(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}\right)^{-1}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t\\ \tilde{\mu}_t(X_t,X_0)=\tilde{\beta}_t\left(\frac{\sqrt{\alpha_t}}{\beta_t}X_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}X_0\right)=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}X_0+\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}X_t \] \(\square\)

Then every KL divergence in L is the KL divergence of 2 Gaussian Distribution, so \(L\) can be calculated in an analytical way instead of MC estimate with high variance to improve training performance

Reparameterization and Simplification

\(L_T\)

Fix \(\beta_t\) to constants, then \(q\) is not learnable and \(L_T\) is thus a fixed constant that can be ignored during training and gradient calculation.

\(L_{1:T-1}\)

For \(p_{\theta}(X_{t-1}|X_t):=N(X_{t-1};\mu_{\theta}(X_t,t),\Sigma_{\theta}(X_t,t))\), set \(\Sigma_{\theta}(X_t,t) = \sigma_t^2\textbf{I}\) to untrained time dependent constants (which is pre-determined before training), Experimentally \(\sigma_t^2=\beta_t\) or \(\sigma_t^2=\tilde{\beta}_t\) has similar results. (These two choice have some reasons, see Section 3.2 in paper)

Then we can rewrite according to the KL divergence of two Gaussian r.v. to \[ L_{t-1}=D_{KL}(q(X_{t-1}|X_t,X_0)||p_{\theta}(X_{t-1}|X_t))=\mathbb{E}_q\left[\frac{1}{2\sigma_t^2}||\tilde{\mu}_t(X_t,X_0)-\mu_{\theta}(X_t,t)||^2\right]+c_3 \] where \(c_3\) is an constant independent of parameters.

With a further step of reparameterization of \(q(X_t|X_0)=N(X_t;\sqrt{\bar{\alpha}_t}X_0,(1-\bar{\alpha}_t)\textbf{I})\) as \[ X_t(X_0, \varepsilon)=\sqrt{\bar{\alpha}_t}X_0 +\sqrt{1-\bar{\alpha}_t}\varepsilon, \varepsilon\sim N(0,\textbf{I}) \] then, \[ \begin{aligned} L_{t-1}-c_3 &= \mathbb{E}_{X_0,\varepsilon}\left[ \frac{1}{2\sigma_t^2} \left\Vert\tilde{\mu}_t\left(X_t(X_0, \varepsilon),\frac{1}{\sqrt{\bar{\alpha}_t}}(X_t(X_0, \varepsilon)-\sqrt{1-\bar{\alpha}_t}\varepsilon)\right) -\mu_{\theta}(X_t(X_0, \varepsilon),t)\right\Vert^2 \right]\\ &= \mathbb{E}_{X_0,\varepsilon}\left[ \frac{1}{2\sigma_t^2} \left\Vert\frac{1}{\sqrt{\alpha_t}}\left(X_t(X_0,\varepsilon)-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon\right) -\mu_{\theta}(X_t(X_0, \varepsilon),t)\right\Vert^2 \right]\\ \end{aligned} \] So, \(\mu_{\theta}(X_t(X_0, \varepsilon),t)\) have to predict \(\frac{1}{\sqrt{\alpha_t}}\left(X_t(X_0,\varepsilon)-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon\right)\) given \(X_t\), which is the input of the model during reverse process.

Thus, we can choose \[ \mu_{\theta}(X_t,t)=\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon_{\theta}(X_t,t)\right) \] to convert the prediction of \(\mu_{\theta}(X_t,t)\) to \(\varepsilon_{\theta}(X_t,t)\) from \(X_t\)

Then, to sample \(X_{t-1}\) from \(X_t\) is \[ X_{t-1} =\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon_{\theta}(X_t,t)\right)+\sigma_t\textbf{z},\textbf{z}\sim N(0,\textbf{I}) \]

Furthermore, we can simplify \[ \begin{aligned} L_{t-1}-c_3 &= \mathbb{E}_{X_0,\varepsilon}\left[ \frac{1}{2\sigma_t^2} \left\Vert\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon\right) -\mu_{\theta}(X_t,t)\right\Vert^2 \right]\\ &= \mathbb{E}_{X_0,\varepsilon}\left[ \frac{1}{2\sigma_t^2} \left\Vert\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon\right) -\frac{1}{\sqrt{\alpha_t}}\left(X_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon_{\theta}(X_t,t)\right) \right\Vert^2 \right]\\ &= \mathbb{E}_{X_0,\varepsilon}\left[ \frac{1}{2\sigma_t^2} \left\Vert\frac{1}{\sqrt{\alpha_t}}\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon -\frac{1}{\sqrt{\alpha_t}}\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\varepsilon_{\theta}(X_t,t) \right\Vert^2 \right]\\ &= \mathbb{E}_{X_0,\varepsilon}\left[ \frac{\beta_t}{2\sigma_t^2\alpha_t(1-\bar{\alpha_t})} \left\Vert\varepsilon-\varepsilon_{\theta}(\sqrt{\bar{\alpha}_t}X_0 +\sqrt{1-\bar{\alpha}_t}\varepsilon,t) \right\Vert^2 \right]\\ \end{aligned} \]

So this loss part is actually minimizing the MSE of true noise and predicted noise.

\(L_0\)

Unlike how \(L_{t-1}\) is handled, \(L_0\) is handled in a special way to make the model stabilize in the final reverse process and make the model better reconstruct the original image, since it is just a sort of discrete pdf of the original data. So the final predicted \(\mu_{\theta}(X_1, 1)\) is basically noiseless.

To be specific \[ p_{\theta}(X_0|X_1)=\prod_{i=1}^D\int_{\delta_-(x_0^i)}^{\delta_+(x_0^i)}\mathcal{N}(x;\mu_{\theta}^i(X_1,1),\sigma_1^2)dx \]

\[ \delta_{+}(x)=\begin{aligned} \begin{cases} \infty &\text{if}\ x=1\\ x+\frac{1}{255} &\text{if}\ x<1 \end{cases} \end{aligned} \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \delta_{-}(x)=\begin{aligned} \begin{cases} -\infty &\text{if}\ x=-1\\ x-\frac{1}{255} &\text{if}\ x>-1 \end{cases} \end{aligned} \]

here 255 comes from the max num of 8-bit single channel image or 24-bit 3 channel image

Final Simple Loss function

\[ L_{\text{simple}}(\theta):=\mathbb{E}_{t,X_0,\varepsilon}\left[\Vert\varepsilon-\varepsilon_{\theta}(\sqrt{\bar{\alpha}_t}X_0+\sqrt{1-\bar{\alpha}_t}\varepsilon, t)\Vert^2\right] \]

comes from the final result of \(L_{t-1}\), which is unweighted, and combines \(L_0\) as a special example indicated by \(t=1\), which ignores \(\sigma_1^2\) and edge effects. \(L_T\) is ignored here since it is a constant.

This kind of reweighting actually improves the model's performance, since when \(t\) is small, the weights are down-weighted and force the model to learn small and hard noise while when \(t\) is large, it focuses on large noise.

机器学习

扩散模型数学概率论

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

食用说明上一篇

生成函数大杂烩下一篇