I understand diffusion models somewhat, but I simply don't understand how the actual denoising steps can be skipped in this method. Like isn't the whole point of diffusion, which this uses as a basis, that it must go stepwise from noise to image? Even distilling a diffusion model and accelerating it, it feels crazy that it could go from taking 50 steps to taking 1 or 2.
I've looked at a few papers but none seems to explain in simple terms what is going on here. It's like saying, we made a new car where you turn it on and you are at your destination. If anyone has come across an accessible explanation of this approach I'd love to look at it.
to use an imperfect analogy, they're learning how to get from point a to point b in a straight line versus having to drive around a meandering path through the neighborhood taking every back road
I understand diffusion models somewhat, but I simply don't understand how the actual denoising steps can be skipped in this method. Like isn't the whole point of diffusion, which this uses as a basis, that it must go stepwise from noise to image? Even distilling a diffusion model and accelerating it, it feels crazy that it could go from taking 50 steps to taking 1 or 2.
I've looked at a few papers but none seems to explain in simple terms what is going on here. It's like saying, we made a new car where you turn it on and you are at your destination. If anyone has come across an accessible explanation of this approach I'd love to look at it.
to use an imperfect analogy, they're learning how to get from point a to point b in a straight line versus having to drive around a meandering path through the neighborhood taking every back road
Read the paper: https://arxiv.org/abs/2410.11081