eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers

NVIDIA Corporation

TL;DR: eDiff-I is a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive painting with words capabilities.

We propose eDiff-I, a diffusion model for synthesizing images given text. Motivated by the empirical observation that the behavior of diffusion models differ at different stages of sampling, we propose to train an ensemble of expert denoising networks, each specializing for a specific noise interval. Our model is conditioned on the T5 text embeddings, CLIP image embeddings and CLIP text embeddings. Our approach can generate photorealistic images correponding to any input text prompt. In addition to text-to-image synthesis, we present two additional capabilies - (1) style transfer, which enables us to control the style of the generated sample using a reference style image, and (2) "Paint with words" - an application where the user can generate images by painting segmentation maps on canvas, which is very handy for crafting the desired image in mind.

A highly detailed digital painting of a portal in a mystic forest with many beautiful trees. A person is standing in front of the portal

A highly detailed zoomed-in digital painting of a cat dressed as a witch wearing a wizard hat in a haunted house, artstation

An image of a beautiful landscape of an ocean. There is a huge rock in the middle of the ocean. There is a mountain in the background. Sun is setting.

Style Reference

A photo of a duckling wearing a medieval soldier helmet and riding a skateboard.

A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.

Video

Pipeline

Our pipeline consists of a cascade of three diffusion models - a base model which can synthesize samples of 64x64 resolution, and two super-resolution stacks that can upsample the images progressively to 256x256 and 1024x1024 resolution respectively. Our models take an input caption and first compute T5 XXL embedding and text embedding. We optionally use CLIP image encodings which are computed from a reference image. These image embeddings can serve as a style vector. These embeddings are then fed into our cascaded diffusion models which progressively generate images of resolution 1024x1024.

Pipeline

Denoising experts

In diffusion models, image synthesis happens via an iterative denoising process that gradually generates images from random noise. In the figure shown below, we start from a complete random noise which is then gradually denoised in multiple steps to finally produce an image of a panda riding a bike. In conventional diffusion model training, a single model is trained to denoise the whole noise distribution. In our framework, we instead train an ensemble of expert denoisers that are specialized for denoising in different intervals of the generative process. The use of such expert denoisers leads to improved synthesis capabilities.

Pipeline

Results

Compared to the open-source text to image methods (Stable diffusion) and (DALL-E2), our model conistently leads to improved synthesis quality.

Stable diffusion

DALL-E2

eDiff-I (ours)

There are two Chinese teapots on a table. One pot has a painting of a dragon, while the other pot has a painting of a panda.

A photo of two cute teddy bears sitting on top of a grizzly bear in a beautiful forest. Highly detailed fantasy art, 4k, artstation

A photo of a golden retriever puppy wearing a green shirt. The shirt has text that says “NVIDIA rocks”. Background office. 4k dslr

A photo of two monkeys sitting on a tree. They are holding a wooden board that says “Best friends”, 4K dslr.

A photo of a plate at a restaurant table with spaghetti and red sauce. There is sushi on top of the spaghetti. The dish is garnished with mint leaves. On the side, there is a glass with a purple drink, photorealistic, dslr.

A close-up 4k dslr photo of a cat riding a scooter. It is wearing a plain shirt and has a bandana around its neck. It is wearing a scooter helmet. There are palm trees in the background.

Style transfer

Our method enables style transfer when the CLIP image embeddings are used. From a reference style image, we first extract the CLIP image emebddings which can be used as a style reference vector. The left panel in the figure shown below is a style reference. The middle panel shows the results when style conditioning is enabled. The panel on the right shows the results when style conditioning is disabled. When style conditioning is used, our model generates outputs that are faithful both to the input style and also the input caption. When style conditioning is disabled, we generate images in natural style.

Reference style

Style conditioning enabled

Style conditioning disabled

A photo of two pandas walking on a road.

A detailed oil painting of a beautiful rabbit queen wearing a royal gown in a palace. She is looking outside the window, artistic.

A dslr photo of a dog playing trumpet from the top of a mountain.

A photo of a teddy bear wearing a casual plain white shirt surfing in the ocean.

Paint with words

Our method allows users to control the location of objects mentioned in the text prompt by selecting phrases and scribbling them on the image. The model then makes use of the prompt along with the maps to generate images that can consistent both with the caption and the input map.

Benefit of using denoising experts

We illustrate the benefit of using denoising experts by visualizing samples generated by our approach and comparing it with a baseline that does not use any denoising experts. The left two panels on the figure below denotes a baseline not using any experts, while the right two panels show the results obtaining by using our expert models. We find that we can greately improve on faithfullness to the input text by using denoising experts.

Baseline

eDiff-I

A 4k dslr photo of two teddy bears wearing a sports jersey with the text “eDiffi” written on it. They are on a soccer field.

An origami of a monkey dressed as a monk riding a bike on a mountain.

Compairson between T5 and CLIP text embeddings

We study the effect of using CLIP text embeddings and T5 embeddings in isolation, and compare with the full system that uses CLIP + T5 embeddings. We observe that images generated in the CLIP-text-only setting often contain correct foreground objects and they tend to miss fine-grain details. Images generated in the T5-text-only setting are of higher quality, but they sometimes contain incorrect objects. Using CLIP+T5 results in best performance.

CLIP text only

T5 only

CLIP + T5

A photo of a cute corgi wearing a beret holding a sign that says “Diffusion Models”. There is Eiffel tower in the background.

A photo of a red parrot, a blue parrot and a green parrot singing at a concert in front of a microphone. Colorful lights in the background.

Style Variations

Our method can also generate images with different styles, which can be specified in the input caption.

A photo of penguin working as a fruit vendor in a tropical village.

Real photo

Rembrandt

Pencil sketch

Vincent van Gogh

Egyptian tomb hieroglyphics

Abstract cubism

Pixel art

Hokusai

Madhubani

Citation

@article{balaji2022eDiff-I,
    title={eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers},
    author={Yogesh Balaji and Seungjun Nah and Xun Huang and Arash Vahdat and Jiaming Song and Karsten Kreis and Miika Aittala and Timo Aila and Samuli Laine and Bryan Catanzaro and Tero Karras and Ming-Yu Liu},
    journal={arXiv preprint arXiv:2211.01324},
    year={2022}