AI Art Generation Handbook/Training/DreamBooth

What is Dreambooth ?
We will be focusing exclusively training for Dreambooth. So let go on with it

DreamBooth is a method for customizing and personalizing TextToImage diffusion models. It can achieve excellent results with only a small amount of training data. DreamBooth is developed based on Imagen and the model can be exported as ckpt and loaded into various UIs. However, the Imagen model and pre-trained weights are not available. Therefore, DreamBooth was initially not suitable for stable diffusion.

Later, the diffusers implemented the DreamBooth function and fully adapted it to Stable Diffusion. DreamBooth is easy to overfit quickly. To obtain high-quality images, we must find a "sweet spot" between the training steps and learning rate. It is recommended to use a low learning rate and gradually increase the number of steps until reaching a satisfactory state.

Installing Dreambooth
To download Dreambooth extension is a very easy. First, you need go to the tab "Extension", click on "Available" tab which is below it.

Later, search for extension named: sd_dreambooth_extension and click "Install"  button to install it.

Click on "Installed" tab and click "Apply & Restart UI". For better installation, close the webui.bat and re-open again to ensure proper installation.

Note: The instructions here are based on last update by dev on rev 926ae20

Training In Dreambooth / Model tab
(1) Click on the Dreambooth tab.

(2) For creating new models from scratch, type the Name of the model and select the list of model you already installed in Source Checkpoint. Ensure 512x Model is ticked.

(3) Click Create Model

(4) Wait for ~1minute for WebUI generate template folder

Training In Dreambooth / Settings tab
You can click on Performance (WIP) to make the recommended settings for Dreambooth training on the PC

There are checkboxes in General, namely

(i) Use LORA - Recommended for training of PC with 8GB ~12GB VRAM

(ii) Train imagic Only - Training with single images only

Intervals
"Training Steps Per image(Epoch)"

In "Intervals" section, "Training Steps Per image(Epoch)" specifies the number of steps to train per image.

For the training steps, try training for 175 steps (for Train Object/Style).

Note: The "Train Person" and "Train Object/Style" sections on the "Concepts" page have different optimal values. The higher it goes, the longer training will takes.

For example, if there are three images uploaded and you want to train for a total of 1500 steps, then each image would be trained for 500 steps.

"Pause After N Epochs" specifies how many intervals between each epochs and is set to 0.

"Amount of time to pause between Epochs (s)" is of the same type as the previous setting and is also set to 0.

"Save model Frequency" indicates how often to save a checkpoint. Each save is approximately 4-5 GB, so if disk space is limited, a higher value can be set.

"Save Preview(s) Frequency (Epochs)" indicates how often to preview images during each epoch. It is generally set to 5 and may affect training speed.

Batching
"Batch Size" is used to speed up training time, but it increases GPU memory usage.

"Gradient Accumulation Steps" is the number of steps before the gradient is calculated and backpropagated. Make sure the number of images used in training is divisible by the product of these two values. For example, if both values are set to 2, the overall speed is increased by 4 times, and it is important to ensure that the number of images used in training is divisible by 4.

 "Setting Gradients to None When Zeroing"  uses more GPU memory.

 "Gradient Checkpointing"  reduces GPU memory usage.

Learning Rate
The learning rate is a value typically between 0.0 and 1.0 used during the neural network training process. The goal of the model is to minimize loss, which represents the difference between the predicted output for an input and the true label for that input. Recommended to use any values between 0.000001 to 0.00000175 for good results. Depending on the complexity of models, learning rate for simpler object might requires only 0.001

However, the higher number may produce models that not quite looked what is envisioned with your expectations (wild card) meanwhile too low number meaning model output may produce very similar to what your model looked like (overfitting).

 "Learning rate Scheduler" 

These scheduler are tested and confirmed to work with "Train Object/Style" and explained as these

Constant with Warmup: Imagine you're trying to climb a mountain, but your legs are stiff. You need to warm up first by doing some stretches and light exercises to get your blood flowing before you can start hiking.

Similarly, with this learning rate scheduler, we start with a low learning rate and gradually increase it over a certain number of warmup steps, so that the model can gradually adjust to the training data before we start increasing the learning rate.

Linear with Warmup: Imagine same case as constant warmup, you need to warmup before hiking up the mountain. Once you've warmed up, you start with a slow pace and gradually increase your speed as you get closer to the summit.

This is similar to how the linear with warmup learning rate scheduler works. We start with a low learning rate and gradually increase it linearly over a certain number of warmup steps, and then we keep it constant for the remaining training steps.

Polynomial Learning Rate: Imagine you are climbing mountain on an area with very unpredictable weather. You want to adjust your speed so that you can safely navigate the hike and avoid fall off the mountain. If you go too fast during rains/strong winds, you risk losing control and fall off, and if you go too slow, you won't make it to your destination on time.

This is similar to how the polynomial learning rate scheduler works. We start with a high learning rate and gradually decrease it over time using a polynomial function. The degree of the polynomial determines how quickly the learning rate decreases, and we can adjust this to find the optimal speed for training the model.

Tuning
Use EMA: This option is generally not selected. EMA (Exponential Moving Average) is used for gradient descent and is not important for fine-tuning. It uses exponential moving average weights to avoid overfitting in the final iteration. It can improve the quality of the model, but uses more VRAM during training.

Use 8-bit Adam: Select this option to use 8-bit Adam from Bitsandbytes. It reduces memory usage. Choose fp16 or bf16 in mixed precision, but it is recommended to choose bf16 for better performance.

Memory Attention: Choose Xformers (if use Torch 1.x) to speed up the training process. There is not much difference in memory usage between Xformers and Flash_attention (a type of attention mechanism that can reduce memory usage). Default is the fastest but uses more VRAM, Xformers has average speed and VRAM usage, and Flash_attention is the slowest but uses the least VRAM.|

P.S: Note that there are unconfirmed rumours that Torch 2.0 removed the needs for Memory Attention

Cache Latents : This options is like having a storage space where the model can save the intermediate results it gets when processing data during training. This way, when it needs to process the same data again, it can use the cached results instead of recalculating everything from scratch. This can speed up the training process, but it also requires more VRAM to store the cached results.

Train U-NET : This option is enabled to let the U-NET network is trained concurrently with the diffusion model to improve the quality of the image generation. The U-NET network is trained to take the output of the diffusion model and further refine it to produce a higher quality image. This option is *may* improve the overall quality of the generated images (depending on case by case basis), but it may require additional GPU memory and increase the training time.

Set Ratio of Text Encoder Training: The best value for face images is 0.7, and for style images it is 0.2. If there is not enough GPU memory, set it to 0. Generally, training U-Net may yield better results.

Offset Noise: If the value is set to 0, the effect is disabled and the model will not learn to adjust the brightness and contrast. If enabled, Dreambooth will add some random noise to the brightness and contrast of the input images, which allows the model to learn to adjust these parameters in order to create more realistic and varied output images. A positive value will increase the brightness/contrast of the generated images, while a negative value will decrease the brightness / contrast. However, it is highly recommended to use the input images taken at different lighting condition to simulate realistic looking images.

Freeze CLIP Normalization Layers: When enabled, this options may help to prevent overfitting, which is when the model becomes too specialized to the training data and does not generalize well to new data. By freezing these layers, the model is forced to learn more meaningful and robust representations of the input data, which can improve its ability to generalize to new data. It does not increase VRAM usage but it may increase training time since the model is not able to learn from the normalization layers during training.

Clip Skip: If Clip Skip is set to a high value (> 1), the model training is essentially skipping amount of training layers thus generating images based on a more limited understanding of the input text, which can lead to less coherent and less relevant outputs.

Training in Dreambooth / Generate tab
(1) For Image Generation Library select Diffuser

(2) For Image Generation Scheduler, try with PNDM or LMSDISCRETE scheduler first for better result. If the training are overfitting very quickly, DDIM is usually much better than PNDM and LMSDISCRETE when the model is overfitting.

Settings known to use more VRAM

 * High Batch Size
 * Set Gradients to None When Zeroing
 * Use EMA
 * Full Precision
 * Default Memory attention
 * Cache Latents
 * Text Encoder

Settings that lowers VRAM

 * Low Batch Size
 * Gradient Checkpointing
 * fp16/bf16 precision
 * xformers/flash_attention
 * Step Ratio of Text Encoder Training 0 (no text encoder)