SDXL: the best open-source text-to-image generator AI

latent diffusion models stable diffusion text-to-image generation Aug 04, 2023

Stability AI just released a new version of their Stable Diffusion image generation models. This new version, called SDXL, includes several improvements over previous versions that probably make it the best open-source text-to-image generation model to date.

 

Improvements over Stable Diffusion 2.1

Although SDXL is a latent diffusion model (LDM) like its predecessors, its creators have included changes to the model structure that fix issues from previous versions. Let’s take a look at these improvements in a bit more detail:

 

Two-stage image generation with bigger U-Nets

This model has a two-step system to generate latent features from noise. First, the base U-Net creates a latent feature map that is handed down to another U-Net trained to refine and polish the features of its input. The size of the U-Nets has also increased from 865M parameters in Stable Diffusion 2.1 to 3.5B and 6.6B respectively in SDXL.

The cross-attention layers that allow the U-Nets to condition their outputs on our prompts have also been reshuffled to produce better results.

 

Mechanisms to prevent cropping and blurring

Previous versions of Stable Diffusion had issues with cropping. Since the training dataset uses data augmentation techniques to increase the variability of the inputs, previous models accidentally included the cropping behavior observed in the data. To prevent this from happening, SDXL accepts cropping and target resolution values that allow us to control how much (if any) cropping we want to apply to the generated images, and the level of feature detail, depending on the resolution in which we are generating the images.

 

Trained for different aspect ratios

To allow SDXL to work with different aspect ratios, the network has been fine-tuned with batches of images with varying widths and heights.

 

Improved autoencoder

Like its predecessors, SDXL does not generate the final images from noise directly. Instead, it first generates the features of an image from noise in a lower-dimensional latent space, and then, using an autoencoder, those features are converted into the final image.

To improve the finer details of the generated images, the researchers have changed the way in which they train this autoencoder (higher batch size, exponential weight tracking).

 

Where to try SDXL

In addition to being available on GitHub for the model code and weights, Stability AI has two web apps that let us easily generate and edit images. These are dreamstudio and clipdrop (this article is not sponsored by any of them):

 

DreamStudio

DreamStudio has the best interface for a text-to-image model I have tried so far. You can choose the number of images to be generated, the target resolution, negative keywords, and many other options that allow you to control the output. After an image is generated, you can edit the results and regenerate the parts of the image you don't like.

To generate images you must purchase credits that are then spent when you generate images, based on the computational requirements of your request. 1,000 credits (~5,000 images) can be purchased by $10.

 

Clipdrop

Similar image generation options in a less polished interface. This application offers other image editing tools like background removal or relighting. The application offers a subscription for €7 per month, as well as an usage-based cost API.

 

Comparison with Midjourney v5

The quality of the images generated by SDXL rivals those generated by the latest version of Midjourney (v5 at the time of the writing of this article). Here's a few examples:

 

“A paper boy from the 1920s delivering newspapers on his bike.”

 

“A smiling young man holding a sign that reads good morning.”

 

“A prism decomposing a light ray in its colors.”

 

 

AI moves fast. We help you keep up with it.

Get a monthly selection of the most groundbreaking advances in the world of AI, top code repositories, and our best articles and tutorials. 

We hate SPAM. We will never sell your information, for any reason.