Welcome aMUSEd: Efficient Text-to-Image Generation

Efficient Text-to-Image Generation with aMUSEd

We are excited to present aMUSEd, an efficient non-diffusion text-to-image model that is an open reproduction of Google's MUSE. aMUSEd's generation quality is not the best, but we are releasing a research preview with a permissive license to encourage the community to explore non-diffusion frameworks like Masked Image Modeling (MIM) for image generation.

How it Works

aMUSEd employs a Masked Image Model (MIM) methodology, which requires fewer inference steps compared to the commonly used latent diffusion approach. This not only improves the model's efficiency but also enhances its interpretability. The figure below presents a pictorial overview of how aMUSEd works:

During training:

Input images are tokenized using a VQGAN to obtain image tokens.
The image tokens are then masked according to a cosine masking schedule.
The masked tokens (conditioned on the prompt embeddings computed using a CLIP-L/14 text encoder) are passed to a U-ViT model that predicts the masked patches.

During inference:

The input prompt is embedded using the CLIP-L/14 text encoder.
Iterate till N steps are reached:
- Start with randomly masked tokens and pass them to the U-ViT model along with the prompt embeddings.
- Predict the masked tokens and only keep a certain percentage of the most confident predictions based on the N and mask schedule. Mask the remaining ones and pass them off to the U-ViT model.
- Pass the final output to the VQGAN decoder to obtain the final image.

aMUSEd borrows a lot of similarities from MUSE, but there are some notable differences:

aMUSEd doesn’t follow a two-stage approach for predicting the final masked patches.
Instead of using T5 for text conditioning, CLIP L/14 is used for computing the text embeddings.
Following Stable Diffusion XL (SDXL), additional conditioning, such as image size and cropping, is passed to the U-ViT. This is referred to as “micro-conditioning”.

Using aMUSEd in diffusers

aMUSEd comes fully integrated into diffusers. To use it, we first need to install the libraries:

pip install -U diffusers accelerate transformers -q

Let’s start with text-to-image generation:

import torch
from diffusers import AmusedPipeline

pipe = AmusedPipeline.from_pretrained(
    "amused/amused-512", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "A mecha robot in a favela in expressionist style"
negative_prompt = "low quality, ugly"

image = pipe(prompt, negative_prompt=negative_prompt, generator=torch.manual_seed(0)).images[0]
image

We can study how num_inference_steps affects the quality of the images under a fixed seed:

from diffusers.utils import make_image_grid 

images = []
for step in [5, 10, 15]:
    image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=step, generator=torch.manual_seed(0)).images[0]
    images.append(image)

grid = make_image_grid(images, rows=1, cols=3)
grid

Crucially, because of its small size (only ~800M parameters, including the text encoder and VQ-GAN), aMUSEd is very fast. The figure below provides a comparative study of the inference latencies of different models, including aMUSEd:

Fine-tuning aMUSEd

We provide a simple training script for fine-tuning aMUSEd on custom datasets. With the 8-bit Adam optimizer and float16 precision, it’s possible to fine-tune aMUSEd with just under 11GBs of GPU VRAM. With LoRA, the memory requirements get further reduced to just 7GBs.

aMUSEd comes with an OpenRAIL license, and hence, it’s commercially friendly to adapt. Refer to this directory for more details on fine-tuning.

Limitations

aMUSEd is not a state-of-the-art image generation regarding image quality. We released aMUSEd to encourage the community to explore non-diffusion frameworks like MIM for image generation. We believe MIM’s potential is underexplored, given its benefits:

Inference efficiency
Smaller size, enabling on-device applications
Task transfer without requiring expensive fine-tuning
Advantages of well-established components from the language modeling world

(Note that the original work on MUSE is close-sourced)

For a detailed description of the quantitative evaluation of aMUSEd, refer to the technical report.

Resources

Papers:

Muse: Text-To-Image Generation via Masked Generative Transformers
aMUSEd: An Open MUSE Reproduction
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)
Learning Transferable Visual Models From Natural Language Supervision (CLIP)
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Simple diffusion: End-to-end diffusion for high resolution images (U-ViT)
LoRA: Low-Rank Adaptation of Large Language Models

Code + misc:

aMUSEd training code
aMUSEd documentation
aMUSEd fine-tuning code
aMUSEd models

Acknowledgements

Suraj led training. William led data and supported training. Patrick von Platen supported both training and data and provided general guidance. Robin Rombach did the VQGAN training and provided general guidance. Isamu Isozaki helped with insightful discussions and made code contributions.

Thanks to Patrick von Platen and Pedro Cuenca for their reviews on the blog post draft.

Source: https://huggingface.co/blog/amused