eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers
NVIDIA Corporation
TL;DR: eDiff-I is a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive painting with words capabilities.
We propose eDiff-I, a diffusion model for synthesizing images given text. Motivated by the empirical observation that the behavior of diffusion models differ at different stages of sampling, we propose to train an ensemble of expert denoising networks, each specializing for a specific noise interval. Our model is conditioned on the T5 text embeddings, CLIP image embeddings and CLIP text embeddings. Our approach can generate photorealistic images correponding to any input text prompt. In addition to text-to-image synthesis, we present two additional capabilies - (1) style transfer, which enables us to control the style of the generated sample using a reference style image, and (2) "Paint with words" - an application where the user can generate images by painting segmentation maps on canvas, which is very handy for crafting the desired image in mind.
Style Reference
Video
Pipeline
Our pipeline consists of a cascade of three diffusion models - a base model which can synthesize samples of 64x64 resolution, and two super-resolution stacks that can upsample the images progressively to 256x256 and 1024x1024 resolution respectively. Our models take an input caption and first compute T5 XXL embedding and text embedding. We optionally use CLIP image encodings which are computed from a reference image. These image embeddings can serve as a style vector. These embeddings are then fed into our cascaded diffusion models which progressively generate images of resolution 1024x1024.
Denoising experts
In diffusion models, image synthesis happens via an iterative denoising process that gradually generates images from random noise. In the figure shown below, we start from a complete random noise which is then gradually denoised in multiple steps to finally produce an image of a panda riding a bike. In conventional diffusion model training, a single model is trained to denoise the whole noise distribution. In our framework, we instead train an ensemble of expert denoisers that are specialized for denoising in different intervals of the generative process. The use of such expert denoisers leads to improved synthesis capabilities.
Results
Compared to the open-source text to image methods (Stable diffusion) and (DALL-E2), our model conistently leads to improved synthesis quality.
Style transfer
Our method enables style transfer when the CLIP image embeddings are used. From a reference style image, we first extract the CLIP image emebddings which can be used as a style reference vector. The left panel in the figure shown below is a style reference. The middle panel shows the results when style conditioning is enabled. The panel on the right shows the results when style conditioning is disabled. When style conditioning is used, our model generates outputs that are faithful both to the input style and also the input caption. When style conditioning is disabled, we generate images in natural style.
Paint with words
Our method allows users to control the location of objects mentioned in the text prompt by selecting phrases and scribbling them on the image. The model then makes use of the prompt along with the maps to generate images that can consistent both with the caption and the input map.
Benefit of using denoising experts
We illustrate the benefit of using denoising experts by visualizing samples generated by our approach and comparing it with a baseline that does not use any denoising experts. The left two panels on the figure below denotes a baseline not using any experts, while the right two panels show the results obtaining by using our expert models. We find that we can greately improve on faithfullness to the input text by using denoising experts.
Compairson between T5 and CLIP text embeddings
We study the effect of using CLIP text embeddings and T5 embeddings in isolation, and compare with the full system that uses CLIP + T5 embeddings. We observe that images generated in the CLIP-text-only setting often contain correct foreground objects and they tend to miss fine-grain details. Images generated in the T5-text-only setting are of higher quality, but they sometimes contain incorrect objects. Using CLIP+T5 results in best performance.
Style Variations
Our method can also generate images with different styles, which can be specified in the input caption.
Citation
@article{balaji2022eDiff-I,
title={eDiff-I: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers},
author={Yogesh Balaji and Seungjun Nah and Xun Huang and Arash Vahdat and Jiaming Song and Karsten Kreis and Miika Aittala and Timo Aila and Samuli Laine and Bryan Catanzaro and Tero Karras and Ming-Yu Liu},
journal={arXiv preprint arXiv:2211.01324},
year={2022}