How I Use Pivotal Tuning to Make Better LoRAs in Stable Diffusion

Pivotal tuning is a specialized training technique that involves simultaneously training embeddings and networks, such as the UNet and text encoder, within the same framework. While this method is not widely used among casual LoRA trainers, it has been acknowledged in several publications [1,2,3] and has been successfully implemented in cloneofsimo’s LoRA repository. The benefits of pivotal tuning have been extensively demonstrated, but its limited support from both the training and generation sides has hindered its widespread adoption.

Understanding the significance of pivotal tuning is crucial, as it offers several advantages that enhance the training process and the quality of the generated outputs:

By incorporating pivotal tuning, the risk of text encoder corruption can be minimized, ensuring the integrity of the training process.
Customized naming becomes feasible, eliminating the complexities associated with determining trigger words to avoid inappropriate token conflicts.
Pivotal tuning facilitates a clear demarcation between core characteristics embedded within the training and the fine-tuned weight differences, leading to more efficient model training.
The separation of tasks between the embedding and network training enhances the model’s transferability, allowing for smoother integration with other models and datasets.
The technique enables the provision of appropriate initialization words to the embedding without affecting how the text encoder interprets the input.

To illustrate the practical implementation of pivotal tuning, the author introduces the gekidol model, which employs 16 embeddings, with 15 dedicated to characters and one for style. With the flexibility provided by using embeddings instead of fixed words, the renaming of embeddings becomes a simplified process, eliminating concerns about the text encoder’s understanding of specific trigger words.

The author presents concrete examples highlighting the key advantages of pivotal tuning, emphasizing the reduced reliance on the fine-tuned text encoder and the preservation of character traits through embeddings. This approach effectively distributes the workload, minimizing potential text encoder corruption, as depicted in the visual representations provided, particularly for the anime model types. Additionally, the author notes the enhanced transferability of learned concepts and characteristics across different models, attributing this to the greater adaptability of embeddings compared to weight differences.

While recognizing the practical challenges associated with pivotal tuning, the author recommends the use of HCP-Diffusion, an actively developed trainer that supports pivotal tuning and provides additional functionalities, including the capability to weight reconstruction losses and train DreamArtist++. Despite some limitations and bugs, the use of HCP-Diffusion is endorsed for its flexibility and configurability, making it a preferred choice for advanced users.

Moreover, the post discusses the integration of pivotal tuning into the automatic pipeline of narugo1992 and the ongoing efforts to optimize its implementation in their dataset construction strategy. The author also addresses the organizational complexities arising from the inclusion of multiple embeddings, highlighting solutions proposed by the community, including the implementation of a bundle system to manage embeddings and LoRAs more effectively.

While pivotal tuning presents several advantages, its widespread adoption is impeded by certain challenges, including the management of multiple embeddings during fine-tuning processes and the ongoing development needed to improve its compatibility with various platforms. However, the community’s active engagement and the ongoing efforts to address these challenges underscore the potential of pivotal tuning as a valuable technique in the development of advanced diffusion models.

[1] Kumari, N., Zhang, B., Zhang, R., Shechtman, E., & Zhu, J. Y. (2023). Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1931-1941).

[2] Smith, J. S., Hsu, Y. C., Zhang, L., Hua, T., Kira, Z., Shen, Y., & Jin, H. (2023). Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027.

[3] Gu, Y., Wang, X., Wu, J. Z., Shi, Y., Chen, Y., Fan, Z., … & Shou, M. Z. (2023). Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models. arXiv preprint arXiv:2305.18292.