Introducing Würstchen: Fast Diffusion for Image Generation

Würstchen is a groundbreaking diffusion model that has revolutionized the field of image generation. Its text-conditional component operates in a highly compressed latent space of images, making it an extremely efficient and fast model. In this article, we will delve into the details of Würstchen, its benefits, and how it can be used for image generation.

What is Würstchen?

Würstchen is a diffusion model that employs a two-stage compression technique to achieve a 42x spatial compression. This is a significant improvement over other models, which typically achieve a compression ratio of 4x to 8x. The two-stage compression technique involves a VQGAN (Vector Quantized Generative Adversarial Network) and a Diffusion Autoencoder. The VQGAN is responsible for the first stage of compression, while the Diffusion Autoencoder handles the second stage. Together, these two models form the Decoder, which decodes the compressed images back into pixel space.

Why Another Text-to-Image Model?

Würstchen is an attractive alternative to other text-to-image models, such as Stable Diffusion XL, due to its speed and efficiency. It can generate images much faster than competitor models while using significantly less memory. This makes it an ideal choice for researchers and organizations with limited computational resources.

How to Use Würstchen?

Würstchen can be used through the Diffusers Library, which provides a user-friendly interface for accessing the model. The model can also be used through the Demo, which allows users to try out the model without needing to install any additional software.

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipeline = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")

caption = "Anthropomorphic cat dressed as a firefighter"
images = pipeline(
    caption,
    height=1024,
    width=1536,
    prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
    prior_guidance_scale=4.0,
    num_images_per_prompt=4,
).images

What Image Sizes Does Würstchen Work On?

Würstchen was trained on image resolutions between 1024x1024 and 1536x1536. However, it can also produce good results at higher resolutions, such as 1024x2048. The Prior (Stage C) adapts extremely fast to new resolutions, making it an ideal choice for generating images at higher resolutions.

Models on the Hub

All checkpoints for Würstchen can be found on the Huggingface Hub. Multiple checkpoints, as well as future demos and model weights, can be accessed through the Hub.

Diffusers Integration

Würstchen is fully integrated with the Diffusers Library, which provides various goodies and optimizations out of the box. These include:

Automatic use of PyTorch 2 SDPA accelerated attention
Support for the xFormers flash attention implementation
Model offload to move unused components to CPU while they are not in use
Sequential CPU offload for situations where memory is really precious
Prompt weighting with the Compel library
Support for the mps device on Apple Silicon macs
Use of generators for reproducibility
Sensible defaults for inference to produce high-quality results in most situations

Optimisation Technique 1: Flash Attention

Starting from version 2.0, PyTorch has integrated a highly optimized and resource-friendly version of the attention mechanism called torch.nn.functional.scaled_dot_product_attention or SDPA. Depending on the nature of the input, this function taps into multiple underlying optimizations. Its performance and memory efficiency outshine the traditional attention model.

images = pipeline(caption, height=1024, width=1536, prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS, prior_guidance_scale=4.0, num_images_per_prompt=4).images

For an in-depth look at how diffusers leverages SDPA, check out the documentation.

Optimisation Technique 2: Torch Compile

If you're on the hunt for an extra performance boost, you can make use of torch.compile. It is best to apply it to both the prior's and decoder's main model for the biggest increase in performance.

pipeline.prior_prior = torch.compile(pipeline.prior_prior , mode="reduce-overhead", fullgraph=True)
pipeline.decoder = torch.compile(pipeline.decoder, mode="reduce-overhead", fullgraph=True)

Bear in mind that the initial inference step will take a long time (up to 2 minutes) while the models are being compiled. After that, you can just normally run inference:

images = pipeline(caption, height=1024, width=1536, prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS, prior_guidance_scale=4.0, num_images_per_prompt=4).images

And the good news is that this compilation is a one-time execution. Post that, you're set to experience faster inferences consistently for the same image resolutions.

How Was the Model Trained?

The ability to train this model was only possible through compute resources provided by Stability AI. We want to say a special thank you to Stability for giving us the possibility to pursue this kind of research, with the chance to make it accessible to so many more people!

Resources

Further information about this model can be found in the official diffusers documentation. All the checkpoints can be found on the hub. You can try out the demo here. Join our Discord if you want to discuss future projects or even contribute with your own ideas! Training code and more can be found in the official GitHub repository.

Source: https://huggingface.co/blog/wuerstchen

Introducing Würstchen: Fast Diffusion for Image Generation

Introducing Würstchen: Fast Diffusion for Image Generation