Skip to content

[ECCV 2024] Official repository of ECCV 2024 paper: Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

License

Notifications You must be signed in to change notification settings

YasminZhang/EBAMA

Repository files navigation

🔥 [ECCV 2024] Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Static Badge

This repository hosts the code and resources associated with our paper on multiple-object generation and attribute binding in text-to-image generation models like Stable Diffusion.

Abstract

Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a $z$-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models.

Envirioment Setup

Clone this repository and create a conda environment:

conda env create -f environment.yaml
conda activate ebama

If you rather use an existing environment, just run:

pip install -r requirements.txt

Finally, run:

python -m spacy download en_core_web_trf

to install the transformer-based spaCy NLP parser.

Datasets

In this work, we use the following datasets:

  • AnE dataset from Attend-and-Excite. We provide the AnE dataset ane_data.py in the data folder.
  • DVMP dataset from SynGen. Please follow the repo to randomly generate the DVMP dataset.
  • ABC-6K dataset from StrDiffusion. We provide the full ABC-6K dataset ABC-6K.txt in the data folder and a subset of the dataset in data_abc.py.

EBAMA (our method)

To test our method on a specific prompt, run:

python inference.py --prompt "a purple crown and a blue suitcase" --seed 12345

Note that this will download the stable diffusion model CompVis/stable-diffusion-v1-4. If you rather use an existing copy of the model, provide the absolute path using --model_path. For example, you can use runwayml/stable-diffusion-v1-5 for Stable Diffusion v1.5.

Metrics

We mainly use the following metrics to evaluate the generated images:

  • Text-Image Full Similarity
  • Text-Image Min Similarity
  • Text-Caption Similarity

Besides, we also provide the code to compute the following metrics as defined in Attend-and-Excite:

  • Text-Image Max Similarity
  • Text-Image Avg Similarity

We provide the evaluation code in the metrics folder. To evaluate the generated images and captions, for example, run:

python metrics/compute_clip_similarity.py  

You can define the paths to the generated images and captions and save path in metrics/path_name

Credits

We would like to give credits to the following repositories, from which we adapted certain code components for our research:

Citation

Static Badge

If you find this code or our results useful, please cite as:

@inproceedings{zhang2024object,
	address = {Cham},
	author = {Zhang, Yasi and Yu, Peiyu and Wu, Ying Nian},
	booktitle = {Computer Vision -- ECCV 2024},
	editor = {Leonardis, Ale{\v{s}} and Ricci, Elisa and Roth, Stefan and Russakovsky, Olga and Sattler, Torsten and Varol, G{\"u}l},
	isbn = {978-3-031-72946-1},
	pages = {55--71},
	publisher = {Springer Nature Switzerland},
	title = {Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models},
	year = {2025}}

Star History

Star History Chart

About

[ECCV 2024] Official repository of ECCV 2024 paper: Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages