textual_inversion

Learning Notes on Textual Inversion for Stable Diffusion

Author: Timothy Wong

Published: 24 Feburary 2023

Training Textual Embedding

There is a very good guide on Textual Inversion written by Benny Cheung, please read this first. I followed the steps there and created my own embeddings. Textual embedding is primarily used to teach the Stable Diffusion model your own concept, e.g. a particular face, a specific object, a drawing style… etc.

Here are some artworks created by stable diffusion model: Input images

This note is to provide supplementary information after my own learning.

Data Preparation

It is very important to train the embedding with good quality data. Overwhelming the training data with highly correlated images may confuse the embedding:

Since our goal is to train the embedding of a face (but not the hairstyle, background… etc), it is important to include a wide variety of training data. This means we need:

Here are some of the photos selected from the training set. They have a variety of zoom levels, background, hairstyle, clothing, etc…:

Input images

Training Process

The first step is to create an embedding, you need to navigate to Train > Create Embedding. There are several arguments required at this step:

Next step, go to Preprocess images tab. Select the source and destination directories. The dimension of the image should be 512 * 512. Check the box for Use BLIP for caption, this will automatically generate caption for the images. Click on Preprocess.

Upon inspection of the preprocessed data, it was later found that the quality of caption may not be good enough. I have manually corrected some of the captions.

Afterwards, move to the Train tab. It is found better to have decreasing learning rate to speed up training of the embedding. I used 5e-2:100,5e-3:500,5e-4:1000, 5e-5:1500,5e-6 and batch size is kept at 1.

Noticibly, my device has relatively low-spec and I encountered out-of-memory issue during training. To mitigate this problem, the dimension of the training data is resized to 256 * 256 within the Train tab (the preprocessed files are still 512 * 512).

Once training has started, it will create images at fixed steps. As training progresses, the embedding should produce images which resemble the training data. I noticed that steps over 2000 has very little effect on the embedding.

Generate Images using txt2img

Once the embedding is trained, the embedding file needs to be moved to the correct directory in order to be discovered by the WebUI.

After a few attempts, I found that the effects of the embedding is too strong. For instance, invoking the embedding along creates multiple faces in the same picture. Pictures below shows early attempt in rendering the face:

Generated images

To balance the effect of the embedding in the prompt, you may follow these guides on how to adjust weight:

After a few attempts, I managed to generate some artistic pictures using the embedding. In my case, the effect of the embedding is reduced using a prompt such as:

A (((oil painting))) of [[[[[my_embedding]]]]] by (((Leonardo da Vinci))))

Here are some of the results:

Generated images