Many models operate not on raw data but on a learned compression of it. An autoencoder, for example, squeezes a 512x512 image into a much smaller grid of latent values that preserves the semantic content while dropping imperceptible pixel-level detail.
Working in latent space is often the difference between a method being practical or not: latent diffusion models (the architecture behind Stable Diffusion) run their expensive iterative denoising loop on latents roughly 48 times smaller than the pixel grid, then decode the result back to an image in a single pass.
