Last week, Swiss software engineer Matthias Bühlmann discovered that the popular image synthesis model Stable diffusion can compress existing bitmap images with fewer visual artifacts than JPEG or WebP at high compression ratios, although there are significant caveats.
Steady diffusion is AI model for image synthesis which typically generates images based on textual descriptions (called “prompts”). The AI model learned this ability by studying millions of images pulled from the internet. During the training process, the model makes statistical associations between images and related words, making a much smaller representation of key information about each image and storing them as “weights”, which are mathematical values that represent what the AI model of the image knows, so to speak.
When Stable Diffusion analyzes and “compresses” the images into weighted form, they reside in what the researchers call “latent space,” which is a way of saying that they exist as a kind of fuzzy potential that can be realized in images , after being decoded. . With Stable Diffusion 1.4, the weights file is approximately 4 GB, but represents information for hundreds of millions of images.
While most people use Stable Diffusion with text prompts, Bühlmann cut out the text encoder and instead forced his images to go through the Stable Diffusion image encoder process, which takes a low-fidelity 512×512 image and converts it to a 64×64 latent with higher precision spatial representation. At this point, the image exists with a much smaller data size than the original, but can still be expanded (decoded) back to a 512×512 image with fairly good results.
While conducting tests, Bühlmann found that images compressed with stable diffusion looked subjectively better at higher compression ratios (smaller file size) than JPEG or WebP. In one example, it shows a photo of a candy store that has been compressed to 5.68 KB using JPEG, 5.71 KB using WebP, and 4.98 KB using Stable Diffusion. The Stable Diffusion image appears to have more resolved detail and fewer obvious compression artifacts than those compressed in the other formats.
Currently, however, Bühlmann’s method comes with significant limitations: it’s not good with faces or text, and in some cases it can actually hallucinate detailed features in the decoded image that aren’t present in the source image. (You probably don’t want your image compressor to invent detail in an image that doesn’t exist.) Also, decoding requires a 4 GB Stable Diffusion weights file and additional decoding time.
Although this use of Stable Diffusion is unconventional and more of a fun hack than a practical solution, it could potentially point to new future uses for image fusion models. Bühlmann’s code can be found on Google Colab, and you will find more technical details of his experiment in his post at Towards AI.