This repository is a quick demo fine-tuning a diffusion model FLUX.1 [dev] with QLoRA. The goal of this work was threefold:
- I need a professional photo for job applications.
- I need a portfolio for job applications.
- I want to make Christmas cards, etc. with my family in them.
So this is an experiment in LoRA training with restricted resources.
Experiments is included near the top so that results can be viewed before diving into all of the ideas behind everything.
This work makes small modifications to the LoRA flux quantization research example.
Flux1.dev was selected because it's a current state-of-the-art (open source) diffusion model for image generation. Unfortunately, I do not have a 4090 which most of the examples are trained on. There are a few other sources that provide quantized models already but it's not clear whether they fit on the GPUs that I own (3060 RTX (12GiB GDDR6) or RX 5500 XT OC Gaming (8GiB GDDR5)).
So here we use BitsAndBytes to quantize the model down to a manageable size for my card and construct our own dataset to fine-tune on.
I did not want to use a UI such as Automatic1111 web ui or ComfyUI because they hide the process of training and understanding the actual code that goes into making the model.
Low Rank Adaptation is a parameter
efficient method for fine-tuning LLMs (PEFT) and now diffusion models as well.
It works by decomposing the update matrix
Since my 3060 RTX is a part of the nVidia Ampere architecture, the BitsAndBytes library supports 4-bit quantization with NF4 (Normal Float 4) with empirical results that it performs better than fp4 quantization in the QLoRA paper.
I manually tagged images of myself (JRussell), my wife (ARussell), and our son (CRussell) and implemented a wrapper to make an image dataset, updating the example code to load my dataset instead of the original dataset. Due to privacy concerns, I won't upload the images from the dataset but will provide the textual descriptions as they are relevant to understanding the tuned model.
Most LoRA how-to examples imply that you should describe everything in the scene except for the one thing that you want to train the LoRA adapter for. Most examples also train their adapter for solely a single concept. I wanted to be able to generate images with my entire family in them and therefore desired to have multiple triggers.
I also elected to manually caption instead of generate captions with tools like wd14 or BLIP because then I can understand the impact of each caption better with respect to model generation.