Latent-variable models like VAEs and diffusion models cannot evaluate the likelihood of their training data directly, because doing so would require summing over every possible configuration of the latent variables. The ELBO sidesteps this by providing a lower bound on the log-likelihood that is computable and differentiable, so maximizing the bound pushes the true likelihood up as well.
For diffusion models, the ELBO decomposes into a sum of per-timestep KL divergences between Gaussians, which is what ultimately collapses into the simple noise-prediction loss they are trained with.
