Isn't min-SNR training strategy counter-intuitive? #13054

AbrightWay · 2026-01-29T06:35:32Z

AbrightWay
Jan 29, 2026

Consider the case 'prediction_type' = 'sample' (predict x0). min-SNR training strategy assign weights to each sample of the loss (as in the figure below, but replace $\gamma$ with values > $\gamma$:

This is quite counter-intuitive as (when t → 0), the denoising task of diffusion model is easier than (when t → T). Therefore, the weights (when t → T) should be higher than (when t → 0), and the formula for the weighs should be the inverse of min-SNR weighting, but the paper https://arxiv.org/abs/2303.09556 suggests the opposite weighting.

Can anyone please explain, please?

AbrightWay · 2026-01-29T06:38:31Z

AbrightWay
Jan 29, 2026
Author

@drhead could you please explain this phenomenon?

2 replies

drhead Jan 30, 2026

I wouldn't take the min-SNR weighting as being anything but an approximation of a good timestep weighting, it's meant to simplify the process and like anything that does that it's never going to be perfect or universally applicable. Even in the min-SNR paper they find that learning loss weights is a better way to achieve their objective of equalizing the effective loss weight of each timestep (and while they note tradeoffs I believe they are somewhat overstated, or are at least mitigated by using an MLP like what EDM2 uses).

If the part of this that's unexpected is that the curve is mirrored from what you expect I would recommend checking that the timesteps didn't get inverted somewhere, some implementations put the lowest SNR timestep on the opposite side.

AbrightWay Jan 30, 2026
Author

Thank you for your answer, @drhead ! However, I am still confused when it's common to put more weights the hard-to-optimize parts (prediction at high timesteps for prediction_type='sample'), but min-SNR do the opposite, i.e., weight those parts (prediction at high timesteps, e.g., t= 800) with extremely small value, to the order of 1e-4. This still bugs me. Could explain why min-SNR give such low weights to those hard-to-optimize tasks (prediction at high timesteps for prediction_type='sample')?
Thank you in advanced!

Amjad-AbuRmileh · 2026-03-21T09:01:13Z

Amjad-AbuRmileh
Mar 21, 2026

Great question — this tripped me up initially too, but it makes sense once you think about it from a signal processing perspective.

The key insight is that min-SNR weighting is not about task difficulty — it's about gradient signal quality.

When t → 0 (low noise, high SNR), the model's prediction is already close to the clean signal. The loss gradients are small but stable and informative. When t → T (high noise, low SNR), the input is almost pure noise, so the model is essentially guessing — the gradients are large but extremely noisy and uninformative. They point in near-random directions.

Without min-SNR weighting, these noisy high-t gradients dominate training because their magnitude is large. This is analogous to a classical signal processing problem: a high-variance estimator with poor SNR can degrade overall system performance even if it carries some information. The optimal strategy (in a Wiener filter sense) is to attenuate components in proportion to their noise power.

That's exactly what min-SNR does: it clips the loss weight at high-noise timesteps to prevent those noisy gradients from overwhelming the useful signal from low/mid-noise timesteps.

Think of it this way:

Without weighting: equal weight at all timesteps → high-noise timesteps dominate (large loss magnitude) → training is unstable
With min-SNR: cap the contribution of high-noise timesteps → the model focuses on the regime where it can actually learn meaningful structure

The paper (Section 3.2) shows this formally: the optimal weighting under an SNR-based variance analysis is min(SNR(t), γ), which down-weights high-noise timesteps relative to the unweighted case. The γ parameter controls the clipping threshold — essentially your noise floor.

So it's not "easy tasks get high weight" — it's "timesteps where the gradient signal is reliable get preserved, noisy timesteps get attenuated." Classic matched filtering intuition applied to diffusion training.

1 reply

dang-trinh156 Mar 22, 2026

Now that it makes more sense. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isn't min-SNR training strategy counter-intuitive? #13054

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Isn't min-SNR training strategy counter-intuitive? #13054

Uh oh!

Uh oh!

AbrightWay Jan 29, 2026

Replies: 2 comments · 3 replies

Uh oh!

AbrightWay Jan 29, 2026 Author

Uh oh!

drhead Jan 30, 2026

Uh oh!

AbrightWay Jan 30, 2026 Author

Uh oh!

Amjad-AbuRmileh Mar 21, 2026

Uh oh!

dang-trinh156 Mar 22, 2026

AbrightWay
Jan 29, 2026

Replies: 2 comments 3 replies

AbrightWay
Jan 29, 2026
Author

AbrightWay Jan 30, 2026
Author

Amjad-AbuRmileh
Mar 21, 2026