Classifier-Free Diffusion Guidance
| Entity Passport | |
| Registry ID | arxiv-paper--unknown--2207.12598 |
| License | ArXiv |
| Provider | semantic_scholar |
Cite this paper
Academic & Research Attribution
@misc{arxiv_paper__unknown__2207.12598,
author = {Jonathan Ho},
title = {Classifier-Free Diffusion Guidance Paper},
year = {2026},
howpublished = {\url{https://free2aitools.com/paper/arxiv-paper--unknown--2207.12598}},
note = {Accessed via Free2AITools Knowledge Fortress}
} 🔬Technical Deep Dive
Full Specifications [+]▾
⚖️ Nexus Index V2.0
💬 Index Insight
FNI V2.0 for Classifier-Free Diffusion Guidance: Semantic (S:50), Authority (A:88), Popularity (P:71), Recency (R:100), Quality (Q:45).
Verification Authority
📝 Executive Summary
❝ Cite Node
@article{Unknown2026Classifier-Free,
title={Classifier-Free Diffusion Guidance},
author={},
journal={arXiv preprint arXiv:arxiv-paper--unknown--2207.12598},
year={2026}
} Abstract & Analysis
[2207.12598] Classifier-Free Diffusion Guidance
Classifier-Free Diffusion Guidance
Jonathan Ho & Tim Salimans Google Research, Brain team {jonathanho,salimans}@google.com
Abstract
Classifier guidance is a recently introduced method to trade off mode coverage and sample fidelity in conditional diffusion models post training, in the same spirit as low temperature sampling or truncation in other types of generative models. Classifier guidance combines the score estimate of a diffusion model with the gradient of an image classifier and thereby requires training an image classifier separate from the diffusion model. It also raises the question of whether guidance can be performed without a classifier. We show that guidance can be indeed performed by a pure generative model without such a classifier: in what we call classifier-free guidance, we jointly train a conditional and an unconditional diffusion model, and we combine the resulting conditional and unconditional score estimates to attain a trade-off between sample quality and diversity similar to that obtained using classifier guidance. † † A short version of this paper appeared in the NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications: https://openreview.net/pdf?id=qw8AKxfYbI
1
Introduction
Figure 1: Classifier-free guidance on the malamute class for a 64x64 ImageNet diffusion model. Left to right: increasing amounts of classifier-free guidance, starting from non-guided samples on the left.
Figure 2: The effect of guidance on a mixture of three Gaussians, each mixture component representing data conditioned on a class. The leftmost plot is the non-guided marginal density. Left to right are densities of mixtures of normalized guided conditionals with increasing guidance strength.
Diffusion models have recently emerged as an expressive and flexible family of generative models, delivering competitive sample quality and likelihood scores on image and audio synthesis tasks (Sohl-Dickstein et al., 2015 ; Song & Ermon, 2019 ; Ho et al., 2020 ; Song et al., 2021b ; Kingma et al., 2021 ; Song et al., 2021a ) . These models have delivered audio synthesis performance rivaling the quality of autoregressive models with substantially fewer inference steps (Chen et al., 2021 ; Kong et al., 2021 ) , and they have delivered ImageNet generation results outperforming BigGAN-deep (Brock et al., 2019 ) and VQ-VAE-2 (Razavi et al., 2019 ) in terms of FID score and classification accuracy score (Ho et al., 2021 ; Dhariwal & Nichol, 2021 ) .
Dhariwal & Nichol ( 2021 ) proposed classifier guidance , a technique to boost the sample quality of a diffusion model using an extra trained classifier. Prior to classifier guidance, it was not known how to generate “low temperature” samples from a diffusion model similar to those produced by truncated BigGAN (Brock et al., 2019 ) or low temperature Glow (Kingma & Dhariwal, 2018 ) : naive attempts, such as scaling the model score vectors or decreasing the amount of Gaussian noise added during diffusion sampling, are ineffective (Dhariwal & Nichol, 2021 ) . Classifier guidance instead mixes a diffusion model’s score estimate with the input gradient of the log probability of a classifier. By varying the strength of the classifier gradient, Dhariwal & Nichol can trade off Inception score (Salimans et al., 2016 ) and FID score (Heusel et al., 2017 ) (or precision and recall) in a manner similar to varying the truncation parameter of BigGAN.
We are interested in whether classifier guidance can be performed without a classifier. Classifier guidance complicates the diffusion model training pipeline because it requires training an extra classifier, and this classifier must be trained on noisy data so it is generally not possible to plug in a pre-trained classifier. Furthermore, because classifier guidance mixes a score estimate with a classifier gradient during sampling, classifier-guided diffusion sampling can be interpreted as attempting to confuse an image classifier with a gradient-based adversarial attack. This raises the question of whether classifier guidance is successful at boosting classifier-based metrics such as FID and Inception score (IS) simply because it is adversarial against such classifiers. Stepping in direction of classifier gradients also bears some resemblance to GAN training, particularly with nonparameteric generators; this also raises the question of whether classifier-guided diffusion models perform well on classifier-based metrics because they are beginning to resemble GANs, which are already known to perform well on such metrics.
To resolve these questions, we present classifier-free guidance , our guidance method which avoids any classifier entirely. Rather than sampling in the direction of the gradient of an image classifier, classifier-free guidance instead mixes the score estimates of a conditional diffusion model and a jointly trained unconditional diffusion model. By sweeping over the mixing weight, we attain a FID/IS tradeoff similar to that attained by classifier guidance. Our classifier-free guidance results demonstrate that pure generative diffusion models are capable of synthesizing extremely high fidelity samples possible with other types of generative models.
2
Background
We train diffusion models in continuous time (Song et al., 2021b ; Chen et al., 2021 ; Kingma et al., 2021 ) : letting 𝐱 ∼ p ( 𝐱 ) similar-to 𝐱 𝑝 𝐱 \mathbf{x}\sim p(\mathbf{x}) and 𝐳 = { 𝐳 λ | λ ∈ [ λ min , λ max ] } 𝐳 conditional-set subscript 𝐳 𝜆 𝜆 subscript 𝜆 min subscript 𝜆 max \mathbf{z}={\mathbf{z}{\lambda},|,\lambda\in[\lambda{\mathrm{min}},\lambda_{\mathrm{max}}]} for hyperparameters λ min
q ( 𝐳 λ | 𝐱 ) 𝑞 conditional subscript 𝐳 𝜆 𝐱 \displaystyle q(\mathbf{z}_{\lambda}|\mathbf{x})
= 𝒩 ( α λ 𝐱 , σ λ 2 𝐈 ) , where α λ 2 = 1 / ( 1 + e − λ ) , σ λ 2 = 1 − α λ 2 formulae-sequence absent 𝒩 subscript 𝛼 𝜆 𝐱 superscript subscript 𝜎 𝜆 2 𝐈 formulae-sequence where superscript subscript 𝛼 𝜆 2 1 1 superscript 𝑒 𝜆 superscript subscript 𝜎 𝜆 2 1 superscript subscript 𝛼 𝜆 2 \displaystyle=\mathcal{N}(\alpha_{\lambda}\mathbf{x},\sigma_{\lambda}^{2}\mathbf{I}),\ \text{where}\ \alpha_{\lambda}^{2}=1/(1+e^{-\lambda}),\ \sigma_{\lambda}^{2}=1-\alpha_{\lambda}^{2}
(1)
q ( 𝐳 λ | 𝐳 λ ′ ) 𝑞 conditional subscript 𝐳 𝜆 subscript 𝐳 superscript 𝜆 ′ \displaystyle q(\mathbf{z}_{\lambda}|\mathbf{z}_{\lambda^{\prime}})
= 𝒩 ( ( α λ / α λ ′ ) 𝐳 λ ′ , σ λ | λ ′ 2 𝐈 ) , where λ
Conditioned on 𝐱 𝐱 \mathbf{x} , the forward process can be described in reverse by the transitions q ( 𝐳 λ ′ | 𝐳 λ , 𝐱 ) = 𝒩 ( 𝝁 ~ λ ′ | λ ( 𝐳 λ , 𝐱 ) , σ ~ λ ′ | λ 2 𝐈 ) 𝑞 conditional subscript 𝐳 superscript 𝜆 ′ subscript 𝐳 𝜆 𝐱 𝒩 subscript ~ 𝝁 conditional superscript 𝜆 ′ 𝜆 subscript 𝐳 𝜆 𝐱 subscript superscript ~ 𝜎 2 conditional superscript 𝜆 ′ 𝜆 𝐈 q(\mathbf{z}{\lambda^{\prime}}|\mathbf{z}{\lambda},\mathbf{x})=\mathcal{N}(\tilde{\boldsymbol{\mu}}{\lambda^{\prime}|\lambda}(\mathbf{z}{\lambda},\mathbf{x}),\tilde{\sigma}^{2}_{\lambda^{\prime}|\lambda}\mathbf{I}) , where
𝝁 ~ λ ′ | λ ( 𝐳 λ , 𝐱 ) = e λ − λ ′ ( α λ ′ / α λ ) 𝐳 λ + ( 1 − e λ − λ ′ ) α λ ′ 𝐱 , σ ~ λ ′ | λ 2 = ( 1 − e λ − λ ′ ) σ λ ′ 2 formulae-sequence subscript ~ 𝝁 conditional superscript 𝜆 ′ 𝜆 subscript 𝐳 𝜆 𝐱 superscript 𝑒 𝜆 superscript 𝜆 ′ subscript 𝛼 superscript 𝜆 ′ subscript 𝛼 𝜆 subscript 𝐳 𝜆 1 superscript 𝑒 𝜆 superscript 𝜆 ′ subscript 𝛼 superscript 𝜆 ′ 𝐱 subscript superscript ~ 𝜎 2 conditional superscript 𝜆 ′ 𝜆 1 superscript 𝑒 𝜆 superscript 𝜆 ′ superscript subscript 𝜎 superscript 𝜆 ′ 2 \displaystyle\tilde{\boldsymbol{\mu}}_{\lambda^{\prime}|\lambda}(\mathbf{z}_{\lambda},\mathbf{x})=e^{\lambda-\lambda^{\prime}}(\alpha_{\lambda^{\prime}}/\alpha_{\lambda})\mathbf{z}_{\lambda}+(1-e^{\lambda-\lambda^{\prime}})\alpha_{\lambda^{\prime}}\mathbf{x},\quad\tilde{\sigma}^{2}_{\lambda^{\prime}|\lambda}=(1-e^{\lambda-\lambda^{\prime}})\sigma_{\lambda^{\prime}}^{2}
(3)
The reverse process generative model starts from p θ ( 𝐳 λ min ) = 𝒩 ( 𝟎 , 𝐈 ) subscript 𝑝 𝜃 subscript 𝐳 subscript 𝜆 min 𝒩 0 𝐈 p_{\theta}(\mathbf{z}{\lambda{\mathrm{min}}})=\mathcal{N}(\mathbf{0},\mathbf{I}) . We specify the transitions:
p θ ( 𝐳 λ ′ | 𝐳 λ ) = 𝒩 ( 𝝁 ~ λ ′ | λ ( 𝐳 λ , 𝐱 θ ( 𝐳 λ ) ) , ( σ ~ λ ′ | λ 2 ) 1 − v ( σ λ | λ ′ 2 ) v ) subscript 𝑝 𝜃 conditional subscript 𝐳 superscript 𝜆 ′ subscript 𝐳 𝜆 𝒩 subscript ~ 𝝁 conditional superscript 𝜆 ′ 𝜆 subscript 𝐳 𝜆 subscript 𝐱 𝜃 subscript 𝐳 𝜆 superscript subscript superscript ~ 𝜎 2 conditional superscript 𝜆 ′ 𝜆 1 𝑣 superscript subscript superscript 𝜎 2 conditional 𝜆 superscript 𝜆 ′ 𝑣 \displaystyle p_{\theta}(\mathbf{z}_{\lambda^{\prime}}|\mathbf{z}_{\lambda})=\mathcal{N}(\tilde{\boldsymbol{\mu}}_{\lambda^{\prime}|\lambda}(\mathbf{z}_{\lambda},\mathbf{x}_{\theta}(\mathbf{z}_{\lambda})),(\tilde{\sigma}^{2}_{\lambda^{\prime}|\lambda})^{1-v}(\sigma^{2}_{\lambda|\lambda^{\prime}})^{v})
(4)
During sampling, we apply this transition along an increasing sequence λ min = λ 1
The reverse process mean comes from an estimate 𝐱 θ ( 𝐳 λ ) ≈ 𝐱 subscript 𝐱 𝜃 subscript 𝐳 𝜆 𝐱 \mathbf{x}{\theta}(\mathbf{z}{\lambda})\approx\mathbf{x} plugged into q ( 𝐳 λ ′ | 𝐳 λ , 𝐱 ) 𝑞 conditional subscript 𝐳 superscript 𝜆 ′ subscript 𝐳 𝜆 𝐱 q(\mathbf{z}{\lambda^{\prime}}|\mathbf{z}{\lambda},\mathbf{x}) (Ho et al., 2020 ; Kingma et al., 2021 ) ( 𝐱 θ subscript 𝐱 𝜃 \mathbf{x}{\theta} also receives λ 𝜆 \lambda as input, but we suppress this to keep our notation clean). We parameterize 𝐱 θ subscript 𝐱 𝜃 \mathbf{x}{\theta} in terms of ϵ bold-italic-ϵ {\boldsymbol{\epsilon}} -prediction (Ho et al., 2020 ) : 𝐱 θ ( 𝐳 λ ) = ( 𝐳 λ − σ λ ϵ θ ( 𝐳 λ ) ) / α λ subscript 𝐱 𝜃 subscript 𝐳 𝜆 subscript 𝐳 𝜆 subscript 𝜎 𝜆 subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 subscript 𝛼 𝜆 \mathbf{x}{\theta}(\mathbf{z}{\lambda})=(\mathbf{z}{\lambda}-\sigma{\lambda}{\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda}))/\alpha_{\lambda} , and we train on the objective
𝔼 ϵ , λ [ ‖ ϵ θ ( 𝐳 λ ) − ϵ ‖ 2 2 ] subscript 𝔼 bold-italic-ϵ 𝜆 delimited-[] subscript superscript norm subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 bold-italic-ϵ 2 2 \displaystyle\mathbb{E}_{{\boldsymbol{\epsilon}},\lambda}\!\left[\|{\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda})-{\boldsymbol{\epsilon}}\|^{2}_{2}\right]
(5)
where ϵ ∼ 𝒩 ( 𝟎 , 𝐈 ) similar-to bold-italic-ϵ 𝒩 0 𝐈 {\boldsymbol{\epsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) , 𝐳 λ = α λ 𝐱 + σ λ ϵ subscript 𝐳 𝜆 subscript 𝛼 𝜆 𝐱 subscript 𝜎 𝜆 bold-italic-ϵ \mathbf{z}{\lambda}=\alpha{\lambda}\mathbf{x}+\sigma_{\lambda}{\boldsymbol{\epsilon}} , and λ 𝜆 \lambda is drawn from a distribution p ( λ ) 𝑝 𝜆 p(\lambda) over [ λ min , λ max ] subscript 𝜆 min subscript 𝜆 max [\lambda_{\mathrm{min}},\lambda_{\mathrm{max}}] . This objective is denoising score matching (Vincent, 2011 ; Hyvärinen & Dayan, 2005 ) over multiple noise scales (Song & Ermon, 2019 ) , and when p ( λ ) 𝑝 𝜆 p(\lambda) is uniform, the objective is proportional to the variational lower bound on the marginal log likelihood of the latent variable model ∫ p θ ( 𝐱 | 𝐳 ) p θ ( 𝐳 ) 𝑑 𝐳 subscript 𝑝 𝜃 conditional 𝐱 𝐳 subscript 𝑝 𝜃 𝐳 differential-d 𝐳 \int p_{\theta}(\mathbf{x}|\mathbf{z})p_{\theta}(\mathbf{z})d\mathbf{z} , ignoring the term for the unspecified decoder p θ ( 𝐱 | 𝐳 ) subscript 𝑝 𝜃 conditional 𝐱 𝐳 p_{\theta}(\mathbf{x}|\mathbf{z}) and for the prior at 𝐳 λ min subscript 𝐳 subscript 𝜆 min \mathbf{z}{\lambda{\mathrm{min}}} (Kingma et al., 2021 ) .
If p ( λ ) 𝑝 𝜆 p(\lambda) is not uniform, the objective can be interpreted as weighted variational lower bound whose weighting can be tuned for sample quality (Ho et al., 2020 ; Kingma et al., 2021 ) . We use a p ( λ ) 𝑝 𝜆 p(\lambda) inspired by the discrete time cosine noise schedule of Nichol & Dhariwal ( 2021 ) : we sample λ 𝜆 \lambda via λ = − 2 log tan ( a u + b ) 𝜆 2 𝑎 𝑢 𝑏 \lambda=-2\log\tan(au+b) for uniformly distributed u ∈ [ 0 , 1 ] 𝑢 0 1 u\in[0,1] , where b = arctan ( e − λ max / 2 ) 𝑏 superscript 𝑒 subscript 𝜆 max 2 b=\arctan(e^{-\lambda_{\mathrm{max}}/2}) and a = arctan ( e − λ min / 2 ) − b 𝑎 superscript 𝑒 subscript 𝜆 min 2 𝑏 a=\arctan(e^{-\lambda_{\mathrm{min}}/2})-b . This represents a hyperbolic secant distribution modified to be supported on a bounded interval. For finite timestep generation, we use λ 𝜆 \lambda values corresponding to uniformly spaced u ∈ [ 0 , 1 ] 𝑢 0 1 u\in[0,1] , and the final generated sample is 𝐱 θ ( 𝐳 λ max ) subscript 𝐱 𝜃 subscript 𝐳 subscript 𝜆 max \mathbf{x}{\theta}(\mathbf{z}{\lambda_{\mathrm{max}}}) .
Because the loss for ϵ θ ( 𝐳 λ ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda}) is denoising score matching for all λ 𝜆 \lambda , the score ϵ θ ( 𝐳 λ ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda}) learned by our model estimates the gradient of the log-density of the distribution of our noisy data 𝐳 λ subscript 𝐳 𝜆 \mathbf{z}{\lambda} , that is ϵ θ ( 𝐳 λ ) ≈ − σ λ ∇ 𝐳 λ log p ( 𝐳 λ ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 subscript 𝜎 𝜆 subscript ∇ subscript 𝐳 𝜆 𝑝 subscript 𝐳 𝜆 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda})\approx-\sigma{\lambda}\nabla_{\mathbf{z}{\lambda}}\log p(\mathbf{z}{\lambda}) ; note, however, that because we use unconstrained neural networks to define ϵ θ subscript bold-italic-ϵ 𝜃 {\boldsymbol{\epsilon}}{\theta} , there need not exist any scalar potential whose gradient is ϵ θ subscript bold-italic-ϵ 𝜃 {\boldsymbol{\epsilon}}{\theta} . Sampling from the learned diffusion model resembles using Langevin diffusion to sample from a sequence of distributions p ( 𝐳 λ ) 𝑝 subscript 𝐳 𝜆 p(\mathbf{z}_{\lambda}) that converges to the conditional distribution p ( 𝐱 ) 𝑝 𝐱 p(\mathbf{x}) of the original data 𝐱 𝐱 \mathbf{x} .
In the case of conditional generative modeling, the data 𝐱 𝐱 \mathbf{x} is drawn jointly with conditioning information 𝐜 𝐜 \mathbf{c} , i.e. a class label for class-conditional image generation. The only modification to the model is that the reverse process function approximator receives 𝐜 𝐜 \mathbf{c} as input, as in ϵ θ ( 𝐳 λ , 𝐜 ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda},\mathbf{c}) .
3
Guidance
An interesting property of certain generative models, such as GANs and flow-based models, is the ability to perform truncated or low temperature sampling by decreasing the variance or range of noise inputs to the generative model at sampling time. The intended effect is to decrease the diversity of the samples while increasing the quality of each individual sample. Truncation in BigGAN (Brock et al., 2019 ) , for example, yields a tradeoff curve between FID score and Inception score for low and high amounts of truncation, respectively. Low temperature sampling in Glow (Kingma & Dhariwal, 2018 ) has a similar effect.
Unfortunately, straightforward attempts of implementing truncation or low temperature sampling in diffusion models are ineffective. For example, scaling model scores or decreasing the variance of Gaussian noise in the reverse process cause the diffusion model to generate blurry, low quality samples (Dhariwal & Nichol, 2021 ) .
3.1
Classifier guidance
To obtain a truncation-like effect in diffusion models, Dhariwal & Nichol ( 2021 ) introduce classifier guidance , where the diffusion score ϵ θ ( 𝐳 λ , 𝐜 ) ≈ − σ λ ∇ 𝐳 λ log p ( 𝐳 λ | 𝐜 ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 subscript 𝜎 𝜆 subscript ∇ subscript 𝐳 𝜆 𝑝 conditional subscript 𝐳 𝜆 𝐜 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda},\mathbf{c})\approx-\sigma_{\lambda}\nabla_{\mathbf{z}{\lambda}}\log p(\mathbf{z}{\lambda}|\mathbf{c}) is modified to include the gradient of the log likelihood of an auxiliary classifier model p θ ( 𝐜 | 𝐳 λ ) subscript 𝑝 𝜃 conditional 𝐜 subscript 𝐳 𝜆 p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda}) as follows:
ϵ ~ θ ( 𝐳 λ , 𝐜 ) = ϵ θ ( 𝐳 λ , 𝐜 ) − w σ λ ∇ 𝐳 λ log p θ ( 𝐜 | 𝐳 λ ) ≈ − σ λ ∇ 𝐳 λ [ log p ( 𝐳 λ | 𝐜 ) + w log p θ ( 𝐜 | 𝐳 λ ) ] , subscript ~ bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 𝑤 subscript 𝜎 𝜆 subscript ∇ subscript 𝐳 𝜆 subscript 𝑝 𝜃 conditional 𝐜 subscript 𝐳 𝜆 subscript 𝜎 𝜆 subscript ∇ subscript 𝐳 𝜆 𝑝 conditional subscript 𝐳 𝜆 𝐜 𝑤 subscript 𝑝 𝜃 conditional 𝐜 subscript 𝐳 𝜆 \tilde{{\boldsymbol{\epsilon}}}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c})={\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c})-w\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})\approx-\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}[\log p(\mathbf{z}_{\lambda}|\mathbf{c})+w\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})],
where w 𝑤 w is a parameter that controls the strength of the classifier guidance. This modified score ϵ ~ θ ( 𝐳 λ , 𝐜 ) subscript ~ bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 \tilde{{\boldsymbol{\epsilon}}}{\theta}(\mathbf{z}{\lambda},\mathbf{c}) is then used in place of ϵ θ ( 𝐳 λ , 𝐜 ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda},\mathbf{c}) when sampling from the diffusion model, resulting in approximate samples from the distribution
p ~ θ ( 𝐳 λ | 𝐜 ) ∝ p θ ( 𝐳 λ | 𝐜 ) p θ ( 𝐜 | 𝐳 λ ) w . proportional-to subscript ~ 𝑝 𝜃 conditional subscript 𝐳 𝜆 𝐜 subscript 𝑝 𝜃 conditional subscript 𝐳 𝜆 𝐜 subscript 𝑝 𝜃 superscript conditional 𝐜 subscript 𝐳 𝜆 𝑤 \tilde{p}_{\theta}(\mathbf{z}_{\lambda}|\mathbf{c})\propto p_{\theta}(\mathbf{z}_{\lambda}|\mathbf{c})p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})^{w}.
The effect is that of up-weighting the probability of data for which the classifier p θ ( 𝐜 | 𝐳 λ ) subscript 𝑝 𝜃 conditional 𝐜 subscript 𝐳 𝜆 p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda}) assigns high likelihood to the correct label: data that can be classified well scores high on the Inception score of perceptual quality (Salimans et al., 2016 ) , which rewards generative models for this by design. Dhariwal & Nichol
therefore find that by setting w > 0 𝑤 0 w>0 they can improve the Inception score of their diffusion model, at the expense of decreased diversity in their samples.
Figure 2 illustrates the effect of numerically solved guidance p ~ θ ( 𝐳 λ | 𝐜 ) ∝ p θ ( 𝐳 λ | 𝐜 ) p θ ( 𝐜 | 𝐳 λ ) w proportional-to subscript ~ 𝑝 𝜃 conditional subscript 𝐳 𝜆 𝐜 subscript 𝑝 𝜃 conditional subscript 𝐳 𝜆 𝐜 subscript 𝑝 𝜃 superscript conditional 𝐜 subscript 𝐳 𝜆 𝑤 \tilde{p}{\theta}(\mathbf{z}{\lambda}|\mathbf{c})\propto p_{\theta}(\mathbf{z}{\lambda}|\mathbf{c})p{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})^{w} on a toy 2D example of three classes, in which the conditional distribution for each class is an isotropic Gaussian. The form of each conditional upon applying guidance is markedly non-Gaussian. As guidance strength is increased, each conditional places probability mass farther away from other classes and towards directions of high confidence given by logistic regression, and most of the mass becomes concentrated in smaller regions. This behavior can be seen as a simplistic manifestation of the Inception score boost and sample diversity decrease that occur when classifier guidance strength is increased in an ImageNet model.
Applying classifier guidance with weight w + 1 𝑤 1 w+1 to an unconditional model would theoretically lead to the same result as applying classifier guidance with weight w 𝑤 w to a conditional model, because p θ ( 𝐳 λ | 𝐜 ) p θ ( 𝐜 | 𝐳 λ ) w ∝ p θ ( 𝐳 λ ) p θ ( 𝐜 | 𝐳 λ ) w + 1 proportional-to subscript 𝑝 𝜃 conditional subscript 𝐳 𝜆 𝐜 subscript 𝑝 𝜃 superscript conditional 𝐜 subscript 𝐳 𝜆 𝑤 subscript 𝑝 𝜃 subscript 𝐳 𝜆 subscript 𝑝 𝜃 superscript conditional 𝐜 subscript 𝐳 𝜆 𝑤 1 p_{\theta}(\mathbf{z}{\lambda}|\mathbf{c})p{\theta}(\mathbf{c}|\mathbf{z}{\lambda})^{w}\propto p{\theta}(\mathbf{z}{\lambda})p{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})^{w+1} ; or in terms of scores,
ϵ θ ( 𝐳 λ ) − ( w + 1 ) σ λ ∇ 𝐳 λ log p θ ( 𝐜 | 𝐳 λ ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝑤 1 subscript 𝜎 𝜆 subscript ∇ subscript 𝐳 𝜆 subscript 𝑝 𝜃 conditional 𝐜 subscript 𝐳 𝜆 \displaystyle{\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda})-(w+1)\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})
≈ − σ λ ∇ 𝐳 λ [ log p ( 𝐳 λ ) + ( w + 1 ) log p θ ( 𝐜 | 𝐳 λ ) ] absent subscript 𝜎 𝜆 subscript ∇ subscript 𝐳 𝜆 𝑝 subscript 𝐳 𝜆 𝑤 1 subscript 𝑝 𝜃 conditional 𝐜 subscript 𝐳 𝜆 \displaystyle\approx-\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}[\log p(\mathbf{z}_{\lambda})+(w+1)\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})]
= − σ λ ∇ 𝐳 λ [ log p ( 𝐳 λ | 𝐜 ) + w log p θ ( 𝐜 | 𝐳 λ ) ] , absent subscript 𝜎 𝜆 subscript ∇ subscript 𝐳 𝜆 𝑝 conditional subscript 𝐳 𝜆 𝐜 𝑤 subscript 𝑝 𝜃 conditional 𝐜 subscript 𝐳 𝜆 \displaystyle=-\sigma_{\lambda}\nabla_{\mathbf{z}_{\lambda}}[\log p(\mathbf{z}_{\lambda}|\mathbf{c})+w\log p_{\theta}(\mathbf{c}|\mathbf{z}_{\lambda})],
but interestingly, Dhariwal & Nichol obtain their best results when applying classifier guidance to an already class-conditional model, as opposed to applying guidance to an unconditional model. For this reason, we will stay in the setup of guiding an already conditional model.
3.2
Classifier-free guidance
While classifier guidance successfully trades off IS and FID as expected from truncation or low temperature sampling, it is nonetheless reliant on gradients from an image classifier and we seek to eliminate the classifier for the reasons stated in Section 1 . Here, we describe classifier-free guidance , which achieves the same effect without such gradients. Classifier-free guidance is an alternative method of modifying ϵ θ ( 𝐳 λ , 𝐜 ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda},\mathbf{c}) to have the same effect as classifier guidance, but without a classifier. Algorithms 1 and 2 describe training and sampling with classifier-free guidance in detail.
Algorithm 1 Joint training a diffusion model with classifier-free guidance
1: p uncond subscript 𝑝 uncond p_{\mathrm{uncond}} : probability of unconditional training
2: repeat
3: ( 𝐱 , 𝐜 ) ∼ p ( 𝐱 , 𝐜 ) similar-to 𝐱 𝐜 𝑝 𝐱 𝐜 (\mathbf{x},\mathbf{c})\sim p(\mathbf{x},\mathbf{c}) ▷ ▷ \triangleright Sample data with conditioning from the dataset
4: 𝐜 ← ∅ ← 𝐜 \mathbf{c}\leftarrow\varnothing with probability p uncond subscript 𝑝 uncond p_{\mathrm{uncond}} ▷ ▷ \triangleright Randomly discard conditioning to train unconditionally
5: λ ∼ p ( λ ) similar-to 𝜆 𝑝 𝜆 \lambda\sim p(\lambda) ▷ ▷ \triangleright Sample log SNR value
6: ϵ ∼ 𝒩 ( 𝟎 , 𝐈 ) similar-to bold-italic-ϵ 𝒩 0 𝐈 {\boldsymbol{\epsilon}}\sim\mathcal{N}(\mathbf{0},\mathbf{I})
7: 𝐳 λ = α λ 𝐱 + σ λ ϵ subscript 𝐳 𝜆 subscript 𝛼 𝜆 𝐱 subscript 𝜎 𝜆 bold-italic-ϵ \mathbf{z}{\lambda}=\alpha{\lambda}\mathbf{x}+\sigma_{\lambda}{\boldsymbol{\epsilon}} ▷ ▷ \triangleright Corrupt data to the sampled log SNR value
8: Take gradient step on ∇ θ ‖ ϵ θ ( 𝐳 λ , 𝐜 ) − ϵ ‖ 2 subscript ∇ 𝜃 superscript norm subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 bold-italic-ϵ 2 \nabla_{\theta}\left|{\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda},\mathbf{c})-{\boldsymbol{\epsilon}}\right|^{2} ▷ ▷ \triangleright Optimization of denoising model
9: until converged
Algorithm 2 Conditional sampling with classifier-free guidance
1: w 𝑤 w : guidance strength
2: 𝐜 𝐜 \mathbf{c} : conditioning information for conditional sampling
3: λ 1 , … , λ T subscript 𝜆 1 … subscript 𝜆 𝑇 \lambda_{1},\dotsc,\lambda_{T} : increasing log SNR sequence with λ 1 = λ min subscript 𝜆 1 subscript 𝜆 min \lambda_{1}=\lambda_{\mathrm{min}} , λ T = λ max subscript 𝜆 𝑇 subscript 𝜆 max \lambda_{T}=\lambda_{\mathrm{max}}
4: 𝐳 1 ∼ 𝒩 ( 𝟎 , 𝐈 ) similar-to subscript 𝐳 1 𝒩 0 𝐈 \mathbf{z}_{1}\sim\mathcal{N}(\mathbf{0},\mathbf{I})
5: for t = 1 , … , T 𝑡 1 … 𝑇 t=1,\dotsc,T do ▷ ▷ !!\triangleright Form the classifier-free guided score at log SNR λ t subscript 𝜆 𝑡 \lambda_{t}
6: ϵ ~ t = ( 1 + w ) ϵ θ ( 𝐳 t , 𝐜 ) − w ϵ θ ( 𝐳 t ) subscript ~ bold-italic-ϵ 𝑡 1 𝑤 subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝑡 𝐜 𝑤 subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝑡 \tilde{{\boldsymbol{\epsilon}}}{t}=(1+w){\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{t},\mathbf{c})-w{\boldsymbol{\epsilon}}{\theta}(\mathbf{z}_{t})
▷ ▷ !!\triangleright Sampling step (could be replaced by another sampler, e.g. DDIM)
7: 𝐱 ~ t = ( 𝐳 t − σ λ t ϵ ~ t ) / α λ t subscript ~ 𝐱 𝑡 subscript 𝐳 𝑡 subscript 𝜎 subscript 𝜆 𝑡 subscript ~ bold-italic-ϵ 𝑡 subscript 𝛼 subscript 𝜆 𝑡 \tilde{\mathbf{x}}{t}=(\mathbf{z}{t}-\sigma_{\lambda_{t}}\tilde{\boldsymbol{\epsilon}}{t})/\alpha{\lambda_{t}}
8: 𝐳 t + 1 ∼ 𝒩 ( 𝝁 ~ λ t + 1 | λ t ( 𝐳 t , 𝐱 ~ t ) , ( σ ~ λ t + 1 | λ t 2 ) 1 − v ( σ λ t | λ t + 1 2 ) v ) similar-to subscript 𝐳 𝑡 1 𝒩 subscript ~ 𝝁 conditional subscript 𝜆 𝑡 1 subscript 𝜆 𝑡 subscript 𝐳 𝑡 subscript ~ 𝐱 𝑡 superscript subscript superscript ~ 𝜎 2 conditional subscript 𝜆 𝑡 1 subscript 𝜆 𝑡 1 𝑣 superscript subscript superscript 𝜎 2 conditional subscript 𝜆 𝑡 subscript 𝜆 𝑡 1 𝑣 \mathbf{z}{t+1}\sim\mathcal{N}(\tilde{\boldsymbol{\mu}}{\lambda_{t+1}|\lambda_{t}}(\mathbf{z}{t},\tilde{\mathbf{x}}{t}),(\tilde{\sigma}^{2}{\lambda{t+1}|\lambda_{t}})^{1-v}(\sigma^{2}{\lambda{t}|\lambda_{t+1}})^{v}) if t
ϵ ~ θ ( 𝐳 λ , 𝐜 ) = ( 1 + w ) ϵ θ ( 𝐳 λ , 𝐜 ) − w ϵ θ ( 𝐳 λ ) subscript ~ bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 1 𝑤 subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 𝑤 subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 \displaystyle\tilde{{\boldsymbol{\epsilon}}}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c})=(1+w){\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda},\mathbf{c})-w{\boldsymbol{\epsilon}}_{\theta}(\mathbf{z}_{\lambda})
(6)
Eq. 6 has no classifier gradient present, so taking a step in the ϵ ~ θ subscript ~ bold-italic-ϵ 𝜃 \tilde{\boldsymbol{\epsilon}}{\theta} direction cannot be interpreted as a gradient-based adversarial attack on an image classifier. Furthermore, ϵ ~ θ subscript ~ bold-italic-ϵ 𝜃 \tilde{\boldsymbol{\epsilon}}{\theta} is constructed from score estimates that are non-conservative vector fields due to the use of unconstrained neural networks, so there in general cannot exist a scalar potential such as a classifier log likelihood for which ϵ ~ θ subscript ~ bold-italic-ϵ 𝜃 \tilde{\boldsymbol{\epsilon}}_{\theta} is the classifier-guided score.
Despite the fact that there in general may not exist a classifier for which Eq. 6 is the classifier-guided score, it is in fact inspired by the gradient of an implicit classifier p i ( 𝐜 | 𝐳 λ ) ∝ p ( 𝐳 λ | 𝐜 ) / p ( 𝐳 λ ) proportional-to superscript 𝑝 𝑖 conditional 𝐜 subscript 𝐳 𝜆 𝑝 conditional subscript 𝐳 𝜆 𝐜 𝑝 subscript 𝐳 𝜆 p^{i}(\mathbf{c}|\mathbf{z}{\lambda})\propto p(\mathbf{z}{\lambda}|\mathbf{c})/p(\mathbf{z}{\lambda}) . If we had access to exact scores ϵ ∗ ( 𝐳 λ , 𝐜 ) superscript bold-italic-ϵ subscript 𝐳 𝜆 𝐜 {\boldsymbol{\epsilon}}^{*}(\mathbf{z}{\lambda},\mathbf{c}) and ϵ ∗ ( 𝐳 λ ) superscript bold-italic-ϵ subscript 𝐳 𝜆 {\boldsymbol{\epsilon}}^{}(\mathbf{z}{\lambda}) (of p ( 𝐳 λ | 𝐜 ) 𝑝 conditional subscript 𝐳 𝜆 𝐜 p(\mathbf{z}{\lambda}|\mathbf{c}) and p ( 𝐳 λ ) 𝑝 subscript 𝐳 𝜆 p(\mathbf{z}{\lambda}) , respectively), then the gradient of this implicit classifier would be ∇ 𝐳 λ log p i ( 𝐜 | 𝐳 λ ) = − 1 σ λ [ ϵ ∗ ( 𝐳 λ , 𝐜 ) − ϵ ∗ ( 𝐳 λ ) ] subscript ∇ subscript 𝐳 𝜆 superscript 𝑝 𝑖 conditional 𝐜 subscript 𝐳 𝜆 1 subscript 𝜎 𝜆 delimited-[] superscript bold-italic-ϵ subscript 𝐳 𝜆 𝐜 superscript bold-italic-ϵ subscript 𝐳 𝜆 \nabla{\mathbf{z}{\lambda}}\log p^{i}(\mathbf{c}|\mathbf{z}{\lambda})=-\frac{1}{\sigma_{\lambda}}[{\boldsymbol{\epsilon}}^{}(\mathbf{z}{\lambda},\mathbf{c})-{\boldsymbol{\epsilon}}^{*}(\mathbf{z}{\lambda})] , and classifier guidance with this implicit classifier would modify the score estimate into ϵ ~ ∗ ( 𝐳 λ , 𝐜 ) = ( 1 + w ) ϵ ∗ ( 𝐳 λ , 𝐜 ) − w ϵ ∗ ( 𝐳 λ ) superscript ~ bold-italic-ϵ subscript 𝐳 𝜆 𝐜 1 𝑤 superscript bold-italic-ϵ subscript 𝐳 𝜆 𝐜 𝑤 superscript bold-italic-ϵ subscript 𝐳 𝜆 \tilde{{\boldsymbol{\epsilon}}}^{}(\mathbf{z}_{\lambda},\mathbf{c})=(1+w){\boldsymbol{\epsilon}}^{}(\mathbf{z}{\lambda},\mathbf{c})-w{\boldsymbol{\epsilon}}^{*}(\mathbf{z}{\lambda}) . Note the resemblance to Eq. 6 , but also note that ϵ ~ ∗ ( 𝐳 λ , 𝐜 ) superscript ~ bold-italic-ϵ subscript 𝐳 𝜆 𝐜 \tilde{{\boldsymbol{\epsilon}}}^{}(\mathbf{z}{\lambda},\mathbf{c}) differs fundamentally from ϵ ~ θ ( 𝐳 λ , 𝐜 ) subscript ~ bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 \tilde{{\boldsymbol{\epsilon}}}{\theta}(\mathbf{z}_{\lambda},\mathbf{c}) . The former is constructed from the scaled classifier gradient ϵ ∗ ( 𝐳 λ , 𝐜 ) − ϵ ∗ ( 𝐳 λ ) superscript bold-italic-ϵ subscript 𝐳 𝜆 𝐜 superscript bold-italic-ϵ subscript 𝐳 𝜆 {\boldsymbol{\epsilon}}^{}(\mathbf{z}{\lambda},\mathbf{c})-{\boldsymbol{\epsilon}}^{*}(\mathbf{z}{\lambda}) ; the latter is constructed from the estimate ϵ θ ( 𝐳 λ , 𝐜 ) − ϵ θ ( 𝐳 λ ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda},\mathbf{c})-{\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda}) , and this expression is not in general the (scaled) gradient of any classifier, again because the score estimates are the outputs of unconstrained neural networks.
It is not obvious a priori that inverting a generative model using Bayes’ rule yields a good classifier that provides a useful guidance signal. For example, Grandvalet & Bengio ( 2004 ) find that discriminative models generally outperform implicit classifiers derived from generative models, even in artificial cases where the specification of those generative models exactly matches the data distribution. In cases such as ours, where we expect the model to be misspecified, classifiers derived by Bayes’ rule can be inconsistent (Grünwald & Langford, 2007 ) and we lose all guarantees on their performance. Nevertheless, in Section 4 , we show empirically that classifier-free guidance is able to trade off FID and IS in the same way as classifier guidance. In Section 5 we discuss the implications of classifier-free guidance in relation to classifier guidance.
4
Experiments
Figure 3: Classifier-free guidance on 128x128 ImageNet. Left: non-guided samples, right: classifier-free guided samples with w = 3.0 𝑤 3.0 w=3.0 . Interestingly, strongly guided samples such as these display saturated colors. See Fig. 8 for more.
We train diffusion models with classifier-free guidance on area-downsampled class-conditional ImageNet (Russakovsky et al., 2015 ) , the standard setting for studying tradeoffs between FID and Inception scores starting from the BigGAN paper (Brock et al., 2019 ) .
The purpose of our experiments is to serve as a proof of concept to demonstrate that classifier-free guidance is able to attain a FID/IS tradeoff similar to classifier guidance and to understand the behavior of classifier-free guidance, not necessarily to push sample quality metrics to state of the art on these benchmarks. For this purpose, we use the same model architectures and hyperparameters as the guided diffusion models of Dhariwal & Nichol ( 2021 ) (apart from continuous time training as specified in Section 2 ); those hyperparameter settings were tuned for classifier guidance and hence may be suboptimal for classifier-free guidance. Furthermore, since we amortize the conditional and unconditional models into the same architecture without an extra classifier, we in fact are using less model capacity than previous work. Nevertheless, our classifier-free guided models still produce competitive sample quality metrics and sometimes outperform prior work, as can be seen in the following sections.
4.1
Varying the classifier-free guidance strength
Here we experimentally verify the main claim of this paper: that classifier-free guidance is able to trade off IS and FID in a manner like classifier guidance or GAN truncation. We apply our proposed classifier-free guidance to 64 × 64 64 64 64\times 64 and 128 × 128 128 128 128\times 128 class-conditional ImageNet generation. In Table 1 and Fig. 4 , we show sample quality effects of sweeping over the guidance strength w 𝑤 w on our 64 × 64 64 64 64\times 64 ImageNet models; Table 2 and Fig. 5 show the same for our 128 × 128 128 128 128\times 128 models. We consider w ∈ { 0 , 0.1 , 0.2 , … , 4 } 𝑤 0 0.1 0.2 … 4 w\in{0,0.1,0.2,\ldots,4} and calculate FID and Inception Scores with 50000 samples for each value following the procedures of Heusel et al. ( 2017 ) and Salimans et al. ( 2016 ) . All models used log SNR endpoints λ min = − 20 subscript 𝜆 min 20 \lambda_{\mathrm{min}}=-20 and λ max = 20 subscript 𝜆 max 20 \lambda_{\mathrm{max}}=20 . The 64 × 64 64 64 64\times 64 models used sampler noise interpolation coefficient v = 0.3 𝑣 0.3 v=0.3 and were trained for 400 thousand steps; the 128 × 128 128 128 128\times 128 models used v = 0.2 𝑣 0.2 v=0.2 and were trained for 2.7 million steps.
We obtain the best FID results with a small amount of guidance ( w = 0.1 𝑤 0.1 w=0.1 or w = 0.3 𝑤 0.3 w=0.3 , depending on the dataset) and the best IS result with strong guidance ( w ≥ 4 𝑤 4 w\geq 4 ). Between these two extremes we see a clear trade-off between these two metrics of perceptual quality, with FID monotonically decreasing and IS monotonically increasing with w 𝑤 w . Our results compare favorably to Dhariwal & Nichol ( 2021 ) and Ho et al. ( 2021 ) , and in fact our 128 × 128 128 128 128\times 128 results are the state of the art in the literature. At w = 0.3 𝑤 0.3 w=0.3 , our model’s FID score on 128 × 128 128 128 128\times 128 ImageNet outperforms the classifier-guided ADM-G, and at w = 4.0 𝑤 4.0 w=4.0 , our model outperforms BigGAN-deep at both FID and IS when BigGAN-deep is evaluated its best-IS truncation level.
Figures 1 , 6 , 3 , 7 and 8 show randomly generated samples from our model for different levels of guidance: here we clearly see that increasing classifier-free guidance strength has the expected effect of decreasing sample variety and increasing individual sample fidelity.
Model FID ( ↓ ↓ \downarrow ) IS ( ↑ ↑ \uparrow )
ADM (Dhariwal & Nichol, 2021 )
2.07
CDM (Ho et al., 2021 )
1.48
67.95
Ours p uncond = 0.1 / 0.2 / 0.5 subscript 𝑝 uncond 0.1 0.2 0.5 p_{\mathrm{uncond}}=0.1/0.2/0.5
w = 0.0 𝑤 0.0 w=0.0
1.8 / 1.8 / 2.21 53.71 / 52.9 / 47.61
w = 0.1 𝑤 0.1 w=0.1
1.55 / 1.62 / 1.91 66.11 / 64.58 / 56.1
w = 0.2 𝑤 0.2 w=0.2
2.04 / 2.1 / 2.08 78.91 / 76.99 / 65.6
w = 0.3 𝑤 0.3 w=0.3
3.03 / 2.93 / 2.65 92.8 / 88.64 / 74.92
w = 0.4 𝑤 0.4 w=0.4
4.3 / 4 / 3.44 106.2 / 101.11 / 84.27
w = 0.5 𝑤 0.5 w=0.5
5.74 / 5.19 / 4.34 119.3 / 112.15 / 92.95
w = 0.6 𝑤 0.6 w=0.6
7.19 / 6.48 / 5.27 131.1 / 122.13 / 102
w = 0.7 𝑤 0.7 w=0.7
8.62 / 7.73 / 6.23 141.8 / 131.6 / 109.8
w = 0.8 𝑤 0.8 w=0.8
10.08 / 8.9 / 7.25 151.6 / 140.82 / 116.9
w = 0.9 𝑤 0.9 w=0.9
11.41 / 10.09 / 8.21 161 / 150.26 / 124.6
w = 1.0 𝑤 1.0 w=1.0
12.6 / 11.21 / 9.13 170.1 / 158.29 / 131.1
w = 2.0 𝑤 2.0 w=2.0
21.03 / 18.79 / 16.16 225.5 / 212.98 / 183
w = 3.0 𝑤 3.0 w=3.0
24.83 / 22.36 / 19.75 250.4 / 237.65 / 208.9
w = 4.0 𝑤 4.0 w=4.0
26.22 / 23.84 / 21.48
260.2 / 248.97 / 225.1
Table 1: ImageNet 64x64 results ( w = 0.0 𝑤 0.0 w=0.0 refers to non-guided models).
50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 0 0 10 10 10 20 20 20 IS FID p uncond = 0.1 subscript 𝑝 uncond 0.1 p_{\mathrm{uncond}}=0.1 p uncond = 0.2 subscript 𝑝 uncond 0.2 p_{\mathrm{uncond}}=0.2 p uncond = 0.5 subscript 𝑝 uncond 0.5 p_{\mathrm{uncond}}=0.5
Figure 4: IS/FID curves over guidance strengths for ImageNet 64x64 models. Each curve represents a model with unconditional training probability p uncond subscript 𝑝 uncond p_{\mathrm{uncond}} . Accompanies Table 1 .
4.2
Varying the unconditional training probability
The main hyperparameter of classifier-free guidance at training time is p uncond subscript 𝑝 uncond p_{\mathrm{uncond}} , the probability of training on unconditional generation during joint training of the conditional and unconditional diffusion models. Here, we study the effect of training models on varying p uncond subscript 𝑝 uncond p_{\mathrm{uncond}} on 64 × 64 64 64 64\times 64 ImageNet.
Table 1 and Fig. 4 show the effects of p uncond subscript 𝑝 uncond p_{\mathrm{uncond}} on sample quality. We trained models with p uncond ∈ { 0.1 , 0.2 , 0.5 } subscript 𝑝 uncond 0.1 0.2 0.5 p_{\mathrm{uncond}}\in{0.1,0.2,0.5} , all for 400 thousand training steps, and evaluated sample quality across various guidance strengths. We find p uncond = 0.5 subscript 𝑝 uncond 0.5 p_{\mathrm{uncond}}=0.5 consistently performs worse than p uncond ∈ { 0.1 , 0.2 } subscript 𝑝 uncond 0.1 0.2 p_{\mathrm{uncond}}\in{0.1,0.2} across the entire IS/FID frontier; p uncond ∈ { 0.1 , 0.2 } subscript 𝑝 uncond 0.1 0.2 p_{\mathrm{uncond}}\in{0.1,0.2} perform about equally as well as each other.
Based on these findings, we conclude that only a relatively small portion of the model capacity of the diffusion model needs to be dedicated to the unconditional generation task in order to produce classifier-free guided scores effective for sample quality. Interestingly, for classifier guidance, Dhariwal & Nichol report that relatively small classifiers with little capacity are sufficient for effective classifier guided sampling, mirroring this phenomenon that we found with classifier-free guided models.
4.3
Varying the number of sampling steps
Since the number of sampling steps T 𝑇 T is known to have a major impact on the sample quality of a diffusion model, here we study the effect of varying T 𝑇 T on our 128 × 128 128 128 128\times 128 ImageNet model. Table 2 and Fig. 5 show the effect of varying T ∈ { 128 , 256 , 1024 } 𝑇 128 256 1024 T\in{128,256,1024} over a range of guidance strengths. As expected, sample quality improves when T 𝑇 T is increased, and for this model T = 256 𝑇 256 T=256 attains a good balance between sample quality and sampling speed.
Note that T = 256 𝑇 256 T=256 is approximately the same number of sampling steps used by ADM-G (Dhariwal & Nichol, 2021 ) , which is outperformed by our model. However, it is important to note that each sampling step for our method requires evaluating the denoising model twice, once for the conditional ϵ θ ( 𝐳 λ , 𝐜 ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 𝐜 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda},\mathbf{c}) and once for the unconditional ϵ θ ( 𝐳 λ ) subscript bold-italic-ϵ 𝜃 subscript 𝐳 𝜆 {\boldsymbol{\epsilon}}{\theta}(\mathbf{z}{\lambda}) . Because we used the same model architecture as ADM-G, the fair comparison in terms of sampling speed would be our T = 128 𝑇 128 T=128 setting, which underperforms compared to ADM-G in terms of FID score.
Model FID ( ↓ ↓ \downarrow ) IS ( ↑ ↑ \uparrow )
BigGAN-deep, max IS (Brock et al., 2019 )
25 253
BigGAN-deep (Brock et al., 2019 )
5.7 124.5
CDM (Ho et al., 2021 )
3.52 128.8
LOGAN (Wu et al., 2019 )
3.36 148.2
ADM-G (Dhariwal & Nichol, 2021 )
2.97
Ours T = 128 / 256 / 1024 𝑇 128 256 1024 T=128/256/1024
w = 0.0 𝑤 0.0 w=0.0
8.11 / 7.27 / 7.22 81.46 / 82.45 / 81.54
w = 0.1 𝑤 0.1 w=0.1
5.31 / 4.53 / 4.5 105.01 / 106.12 / 104.67
w = 0.2 𝑤 0.2 w=0.2
3.7 / 3.03 / 3 130.79 / 132.54 / 130.09
w = 0.3 𝑤 0.3 w=0.3
3.04 / 2.43 / 2.43
156.09 / 158.47 / 156
w = 0.4 𝑤 0.4 w=0.4
3.02 / 2.49 / 2.48 183.01 / 183.41 / 180.88
w = 0.5 𝑤 0.5 w=0.5
3.43 / 2.98 / 2.96 206.94 / 207.98 / 204.31
w = 0.6 𝑤 0.6 w=0.6
4.09 / 3.76 / 3.73 227.72 / 228.83 / 226.76
w = 0.7 𝑤 0.7 w=0.7
4.96 / 4.67 / 4.69 247.92 / 249.25 / 247.89
w = 0.8 𝑤 0.8 w=0.8
5.93 / 5.74 / 5.71 265.54 / 267.99 / 265.52
w = 0.9 𝑤 0.9 w=0.9
6.89 / 6.8 / 6.81 280.19 / 283.41 / 281.14
w = 1.0 𝑤 1.0 w=1.0
7.88 / 7.86 / 7.8 295.29 / 297.98 / 294.56
w = 2.0 𝑤 2.0 w=2.0
15.9 / 15.93 / 15.75 378.56 / 377.37 / 373.18
w = 3.0 𝑤 3.0 w=3.0
19.77 / 19.77 / 19.56 409.16 / 407.44 / 405.68
w = 4.0 𝑤 4.0 w=4.0
21.55 / 21.53 / 21.45
422.29 / 421.03 / 419.06
Table 2: ImageNet 128x128 results ( w = 0.0 𝑤 0.0 w=0.0 refers to non-guided models).
50 50 50 100 100 100 150 150 150 200 200 200 250 250 250 300 300 300 350 350 350 400 400 400 450 450 450 10 10 10 20 20 20 IS FID T = 128 𝑇 128 T=128 T = 256 𝑇 256 T=256 T = 512 𝑇 512 T=512
Figure 5: IS/FID curves over guidance strengths for ImageNet 128x128 models. Each curve represents sampling with a different number of timesteps T 𝑇 T . Accompanies Table 2 .
5
Discussion
The most practical advantage of our classifier-free guidance method is its extreme simplicity: it is only a one-line change of code during training—to randomly drop out the conditioning—and during sampling—to mix the conditional and unconditional score estimates. Classifier guidance, by contrast, complicates the training pipeline since it requires training an extra classifier. This classifier must be trained on noisy 𝐳 λ subscript 𝐳 𝜆 \mathbf{z}_{\lambda} , so it is not possible to plug in a standard pre-trained classifier.
Since classifier-free guidance is able to trade off IS and FID like classifier guidance without needing an extra trained classifier, we have demonstrated that guidance can be performed with a pure generative model. Furthermore, our diffusion models are parameterized by unconstrained neural networks and therefore their score estimates do not necessarily form conservative vector fields, unlike classifier gradients (Salimans & Ho, 2021 ) . Therefore, our classifier-free guided sampler follows step directions that do not resemble classifier gradients at all and thus cannot be interpreted as a gradient-based adversarial attack on a classifier, and hence our results show that boosting the classifier-based IS and FID metrics can be accomplished with pure generative models with a sampling procedure that is not adversarial against image classifiers using classifier gradients.
We also have arrived at an intuitive explanation for how guidance works: it decreases the unconditional likelihood of the sample while increasing the conditional likelihood. Classifier-free guidance accomplishes this by decreasing the unconditional likelihood with a negative score term, which to our knowledge has not yet been explored and may find uses in other applications.
Classifier-free guidance as presented here relies on training an unconditional model, but in some cases this can be avoided. If the class distribution is known and there are only a few classes, we can use the fact that ∑ 𝐜 p ( 𝐱 | 𝐜 ) p ( 𝐜 ) = p ( 𝐱 ) subscript 𝐜 𝑝 conditional 𝐱 𝐜 𝑝 𝐜 𝑝 𝐱 \sum_{\mathbf{c}}p(\mathbf{x}|\mathbf{c})p(\mathbf{c})=p(\mathbf{x}) to obtain an unconditional score from conditional scores without explicitly training for the unconditional score. Of course, this would require as many forward passes as there are possible values of 𝐜 𝐜 \mathbf{c} and would be inefficient for high dimensional conditioning.
A potential disadvantage of classifier-free guidance is sampling speed. Generally, classifiers can be smaller and faster than generative models, so classifier guided sampling may be faster than classifier-free guidance because the latter needs to run two forward passes of the diffusion model, one for conditional score and another for the unconditional score. The necessity to run multiple passes of the diffusion model might be mitigated by changing the architecture to inject conditioning late in the network, but we leave this exploration for future work.
Finally, any guidance method that increases sample fidelity at the expense of diversity must face the question of whether decreased diversity is acceptable. There may be negative impacts in deployed models, since sample diversity is important to maintain in applications where certain parts of the data are underrepresented in the context of the rest of the data. It would be an interesting avenue of future work to try to boost sample quality while maintaining sample diversity.
6
Conclusion
We have presented classifier-free guidance, a method to increase sample quality while decreasing sample diversity in diffusion models. Classifier-free guidance can be thought of as classifier guidance without a classifier, and our results showing the effectiveness of classifier-free guidance confirm that pure generative diffusion models are capable of maximizing classifier-based sample quality metrics while entirely avoiding classifier gradients. We look forward to further explorations of classifier-free guidance in a wider variety of settings and data modalities.
7
Acknowledgements
We thank Ben Poole and Mohammad Norouzi for discussions.
References
Brock et al. (2019)
Andrew Brock, Jeff Donahue, and Karen Simonyan.
Large scale GAN training for high fidelity natural image synthesis.
In International Conference on Learning Representations , 2019.
Chen et al. (2021)
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan.
WaveGrad: Estimating gradients for waveform generation.
International Conference on Learning Representations , 2021.
Dhariwal & Nichol (2021)
Prafulla Dhariwal and Alex Nichol.
Diffusion models beat GANs on image synthesis.
arXiv preprint arXiv:2105.05233 , 2021.
Grandvalet & Bengio (2004)
Yves Grandvalet and Yoshua Bengio.
Semi-supervised learning by entropy minimization.
In Proceedings of the 17th International Conference on Neural Information Processing Systems , pp. 529–536, 2004.
Grünwald & Langford (2007)
Peter Grünwald and John Langford.
Suboptimal behavior of bayes and mdl in classification under misspecification.
Machine Learning , 66(2-3):119–149, 2007.
Heusel et al. (2017)
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.
GANs trained by a two time-scale update rule converge to a local Nash equilibrium.
In Advances in Neural Information Processing Systems , pp. 6626–6637, 2017.
Ho et al. (2020)
Jonathan Ho, Ajay Jain, and Pieter Abbeel.
Denoising diffusion probabilistic models.
In Advances in Neural Information Processing Systems , pp. 6840–6851, 2020.
Ho et al. (2021)
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans.
Cascaded diffusion models for high fidelity image generation.
arXiv preprint arXiv:2106.15282 , 2021.
Hyvärinen & Dayan (2005)
Aapo Hyvärinen and Peter Dayan.
Estimation of non-normalized statistical models by score matching.
Journal of Machine Learning Research , 6(4), 2005.
Kingma & Dhariwal (2018)
Diederik P Kingma and Prafulla Dhariwal.
Glow: Generative flow with invertible 1x1 convolutions.
In Advances in Neural Information Processing Systems , pp. 10215–10224, 2018.
Kingma et al. (2021)
Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho.
Variational diffusion models.
arXiv preprint arXiv:2107.00630 , 2021.
Kong et al. (2021)
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro.
DiffWave: A Versatile Diffusion Model for Audio Synthesis.
International Conference on Learning Representations , 2021.
Nichol & Dhariwal (2021)
Alex Nichol and Prafulla Dhariwal.
Improved denoising diffusion probabilistic models.
International Conference on Machine Learning , 2021.
Razavi et al. (2019)
Ali Razavi, Aaron van den Oord, and Oriol Vinyals.
Generating diverse high-fidelity images with VQ-VAE-2.
In Advances in Neural Information Processing Systems , pp. 14837–14847, 2019.
Russakovsky et al. (2015)
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.
ImageNet large scale visual recognition challenge.
International Journal of Computer Vision , 115(3):211–252, 2015.
Salimans & Ho (2021)
Tim Salimans and Jonathan Ho.
Should EBMs model the energy or the score?
In Energy Based Models Workshop-ICLR 2021 , 2021.
Salimans et al. (2016)
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.
Improved techniques for training GANs.
In Advances in Neural Information Processing Systems , pp. 2234–2242, 2016.
Sohl-Dickstein et al. (2015)
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.
Deep unsupervised learning using nonequilibrium thermodynamics.
In International Conference on Machine Learning , pp. 2256–2265, 2015.
Song & Ermon (2019)
Yang Song and Stefano Ermon.
Generative modeling by estimating gradients of the data distribution.
In Advances in Neural Information Processing Systems , pp. 11895–11907, 2019.
Song et al. (2021a)
Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon.
Maximum likelihood training of score-based diffusion models.
arXiv e-prints , pp. arXiv–2101, 2021a.
Song et al. (2021b)
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations.
International Conference on Learning Representations , 2021b.
Vincent (2011)
Pascal Vincent.
A connection between score matching and denoising autoencoders.
Neural Computation , 23(7):1661–1674, 2011.
Wu et al. (2019)
Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap.
LOGAN: Latent optimisation for generative adversarial networks.
arXiv preprint arXiv:1912.00953 , 2019.
Appendix A
Samples
(a) Non-guided conditional sampling: FID=1.80, IS=53.71
(b) Classifier-free guidance with w = 1.0 𝑤 1.0 w=1.0 : FID=12.6, IS=170.1
(c) Classifier-free guidance with w = 3.0 𝑤 3.0 w=3.0 : FID=24.83, IS=250.4
Figure 6: Classifier-free guidance on ImageNet 64x64. Left: random classes. Right: single class (malamute). The same random seed was used for sampling in each subfigure.
(a) Non-guided conditional sampling: FID=7.27, IS=82.45
(b) Classifier-free guidance with w = 1.0 𝑤 1.0 w=1.0 : FID=7.86, IS=297.98
(c) Classifier-free guidance with w = 4.0 𝑤 4.0 w=4.0 : FID=21.53, IS=421.03
Figure 7: Classifier-free guidance on ImageNet 128x128. Left: random classes. Right: single class (malamute). The same random seed was used for sampling in each subfigure.
Figure 8: More examples of classifier-free guidance on 128x128 ImageNet. Left: non-guided samples, right: classifier-free guided samples with w = 3.0 𝑤 3.0 w=3.0 .
◄
Feeling lucky?
Conversion report
Report an issue
View original on arXiv ►
AI Summary: Based on semantic_scholar metadata. Not a recommendation.
🛡️ Paper Transparency Report
Technical metadata sourced from upstream repositories.
🆔 Identity & Source
- id
- arxiv-paper--unknown--2207.12598
- slug
- unknown--2207.12598
- source
- semantic_scholar
- author
- Jonathan Ho
- license
- ArXiv
- tags
- paper, research, academic
⚙️ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
📊 Engagement & Metrics
- downloads
- 0
- stars
- 0
- forks
- 0
Data indexed from public sources. Updated daily.