This demo application ("demoDiffusion") showcases the acceleration of Stable Diffusion and ControlNet pipeline using TensorRT.
git clone git@github.com:NVIDIA/TensorRT.git -b release/10.8 --single-branch
cd TensorRT
Install nvidia-docker using these intructions.
docker run --rm -it --gpus all -v $PWD:/workspace nvcr.io/nvidia/pytorch:25.01-py3 /bin/bash
NOTE: The demo supports CUDA>=11.8
python3 -m pip install --upgrade pip
pip install --pre tensorrt-cu12
Check your installed version using:
python3 -c 'import tensorrt;print(tensorrt.__version__)'
NOTE: Alternatively, you can download and install TensorRT packages from NVIDIA TensorRT Developer Zone.
export TRT_OSSPATH=/workspace
cd $TRT_OSSPATH/demo/Diffusion
pip3 install -r requirements.txt
NOTE: demoDiffusion has been tested on systems with NVIDIA H100, A100, L40, T4, and RTX4090 GPUs, and the following software configuration.
diffusers 0.31.0
onnx 1.15.0
onnx-graphsurgeon 0.5.2
onnxruntime 1.16.3
polygraphy 0.49.9
tensorrt 10.8.0.43
tokenizers 0.13.3
torch 2.2.0
transformers 4.42.2
controlnet-aux 0.0.6
nvidia-modelopt 0.15.1
python3 demo_txt2img.py --help
python3 demo_img2img.py --help
python3 demo_inpaint.py --help
python3 demo_controlnet.py --help
python3 demo_txt2img_xl.py --help
python3 demo_txt2img_flux.py --help
To download model checkpoints for the Stable Diffusion pipelines, obtain a read
access token to HuggingFace Hub. See instructions.
export HF_TOKEN=<your access token>
python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN
Run the below command to generate an image with SD1.5 or SD2.1 in INT8
python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --int8
Run the below command to generate an image with SD1.5 or SD2.1 in FP8. (FP8 is only supported on Hopper and Ada.)
python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --fp8
wget https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg -O sketch-mountains-input.jpg
python3 demo_img2img.py "A fantasy landscape, trending on artstation" --hf-token=$HF_TOKEN --input-image=sketch-mountains-input.jpg
wget https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png -O dog-on-bench.png
wget https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png -O dog-mask.png
python3 demo_inpaint.py "a mecha robot sitting on a bench" --hf-token=$HF_TOKEN --input-image=dog-on-bench.png --mask-image=dog-mask.png
NOTE: inpainting is only supported in versions
1.5
and2.0
.
python3 demo_controlnet.py "Stormtrooper's lecture in beautiful lecture hall" --controlnet-type depth --hf-token=$HF_TOKEN --denoising-steps 20 --onnx-dir=onnx-cnet-depth --engine-dir=engine-cnet-depth
NOTE:
--input-image
must be a pre-processed image corresponding to--controlnet-type
. If unspecified, a sample image will be downloaded. Supported controlnet types include:canny
,depth
,hed
,mlsd
,normal
,openpose
,scribble
, andseg
.
Multiple ControlNet types can also be specified to combine the conditionings. While specifying multiple conditionings, controlnet scales should also be provided. The scales signify the importance of each conditioning in relation with the other. For example, to condition using openpose
and canny
with scales of 1.0 and 0.8 respectively, the arguments provided would be --controlnet-type openpose canny
and --controlnet-scale 1.0 0.8
. Note that the number of controlnet scales provided should match the number of controlnet types.
Run the below command to generate an image with Stable Diffusion XL
python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --hf-token=$HF_TOKEN --version=xl-1.0
The optional refiner model may be enabled by specifying --enable-refiner
and separate directories for storing refiner onnx and engine files using --onnx-refiner-dir
and --engine-refiner-dir
respectively.
python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --hf-token=$HF_TOKEN --version=xl-1.0 --enable-refiner --onnx-refiner-dir=onnx-refiner --engine-refiner-dir=engine-refiner
python3 demo_txt2img_xl.py "Picture of a rustic Italian village with Olive trees and mountains" --version=xl-1.0 --lora-path "ostris/crayon_style_lora_sdxl" "ostris/watercolor_style_lora_sdxl" --lora-weight 0.3 0.7 --onnx-dir onnx-sdxl-lora --engine-dir engine-sdxl-lora --build-enable-refit
Run the below command to generate an image with Stable Diffusion XL in INT8
python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --int8
Run the below command to generate an image with Stable Diffusion XL in FP8. (FP8 is only supported on Hopper and Ada.)
python3 demo_txt2img_xl.py "a photo of an astronaut riding a horse on mars" --version xl-1.0 --onnx-dir onnx-sdxl --engine-dir engine-sdxl --fp8
Note that INT8 & FP8 quantization is only supported for SDXL, SD1.5, SD2.1 and SD2.1-base, and won't work with LoRA weights. FP8 quantization is only supported on Hopper and Ada. Some prompts may produce better inputs with fewer denoising steps (e.g.
--denoising-steps 20
) but this will repeat the calibration, ONNX export, and engine building processes for the U-Net.
For step-by-step tutorials to run INT8 & FP8 inference on stable diffusion models, please refer to examples in TensorRT ModelOpt diffusers sample.
LCM-LoRA produces good quality images in 4 to 8 denoising steps instead of 30+ needed base model. Note that we use LCM scheduler and disable classifier-free-guidance by setting --guidance-scale
to 0.
LoRA weights are fused into the ONNX and finalized TensorRT plan files in this example.
python3 demo_txt2img_xl.py "Einstein" --version xl-1.0 --lora-path "latent-consistency/lcm-lora-sdxl" --lora-weight 1.0 --onnx-dir onnx-sdxl-lcm-nocfg --engine-dir engine-sdxl-lcm-nocfg --denoising-steps 4 --scheduler LCM --guidance-scale 0.0
Even faster image generation than LCM, producing coherent images in just 1 step. Note: SDXL Turbo works best for 512x512 resolution, EulerA scheduler and classifier-free-guidance disabled.
python3 demo_txt2img_xl.py "Einstein" --version xl-turbo --onnx-dir onnx-sdxl-turbo --engine-dir engine-sdxl-turbo --denoising-steps 1 --scheduler EulerA --guidance-scale 0.0 --width 512 --height 512
Run the command below to generate an image using Stable Diffusion 3
python3 demo_txt2img_sd3.py "A vibrant street wall covered in colorful graffiti, the centerpiece spells \"SD3 MEDIUM\", in a storm of colors" --version sd3 --hf-token=$HF_TOKEN
You can also specify an input image conditioning as shown below
wget https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png -O dog-on-bench.png
python3 demo_txt2img_sd3.py "dog wearing a sweater and a blue collar" --version sd3 --input-image dog-on-bench.png --hf-token=$HF_TOKEN
Note that a denosing-percentage is applied to the number of denoising-steps when an input image conditioning is provided. Its default value is set to 0.6. This parameter can be updated using --denoising-percentage
Download the pre-exported ONNX model
git lfs install
git clone https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt-1-1-tensorrt onnx-svd-xt-1-1
cd onnx-svd-xt-1-1 && git lfs pull && cd ..
SVD-XT-1.1 (25 frames at resolution 576x1024)
python3 demo_img2vid.py --version svd-xt-1.1 --onnx-dir onnx-svd-xt-1-1 --engine-dir engine-svd-xt-1-1 --hf-token=$HF_TOKEN
Run the command below to generate a video in FP8.
python3 demo_img2vid.py --version svd-xt-1.1 --onnx-dir onnx-svd-xt-1-1 --engine-dir engine-svd-xt-1-1 --hf-token=$HF_TOKEN --fp8
NOTE: There is a bug in HuggingFace, you can workaround with following this PR
if torch.is_tensor(num_frames):
num_frames = num_frames.item()
emb = emb.repeat_interleave(num_frames, dim=0)
You may also specify a custom conditioning image using --input-image
:
python3 demo_img2vid.py --version svd-xt-1.1 --onnx-dir onnx-svd-xt-1-1 --engine-dir engine-svd-xt-1-1 --input-image https://www.hdcarwallpapers.com/walls/2018_chevrolet_camaro_zl1_nascar_race_car_2-HD.jpg --hf-token=$HF_TOKEN
NOTE: The min and max guidance scales are configured using --min-guidance-scale and --max-guidance-scale respectively.
Run the below command to generate an image using Stable Cascade
python3 demo_stable_cascade.py --onnx-opset=16 "Anthropomorphic cat dressed as a pilot" --onnx-dir onnx-sc --engine-dir engine-sc
The lite versions of the models are also supported using the command below
python3 demo_stable_cascade.py --onnx-opset=16 "Anthropomorphic cat dressed as a pilot" --onnx-dir onnx-sc-lite --engine-dir engine-sc-lite --lite
NOTE: The pipeline is only enabled for the BF16 model weights
NOTE: The pipeline only supports ONNX export using Opset 16.
NOTE: The denoising steps and guidance scale for the Prior and Decoder models are configured using --prior-denoising-steps, --prior-guidance-scale, --decoder-denoising-steps, and --decoder-guidance-scale respectively.
Install Git LFS:
sudo apt-get install git-lfs
Download ONNX models for the desired pipeline and precision:
# login to huggingface-cli using the $HF_TOKEN
git config --global credential.helper store # set the 'store' credential helper as default
huggingface-cli login --token $HF_TOKEN --add-to-git-credential
# Example for flux.1-dev BF16 pipeline. Models will be downloaded to ./onnx-flux-dev after the script is run.
./scripts/download_flux_onnx_models.sh --version "flux.1-dev" --precision "bf16"
# View supported configurations
./scripts/download_flux_onnx_models.sh --help
# FP16 (requires >48GB VRAM for native export)
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN
# BF16
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --bf16 --onnx-dir onnx-flux-dev/ --model-onnx-dirs=transformer:onnx-flux-dev/transformer_bf16/ --engine-dir engine-flux-dev/bf16
# FP8
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --fp8 --onnx-dir onnx-flux-dev/ --model-onnx-dirs=transformer:onnx-flux-dev/transformer_fp8/ --engine-dir engine-flux-dev/fp8 --build-static-batch
# FP4
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --fp4 --onnx-dir onnx-flux-dev/ --model-onnx-dirs=transformer:onnx-flux-dev/transformer_fp4/ --engine-dir engine-flux-dev/fp4
# FP16 (requires >48GB VRAM for native export)
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell"
# BF16
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell" --bf16 --onnx-dir onnx-flux-schnell/ --model-onnx-dirs=transformer:onnx-flux-schnell/transformer_bf16 --engine-dir engine-flux-schnell/bf16
# FP8
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell" --fp8 --onnx-dir onnx-flux-schnell/ --model-onnx-dirs=transformer:onnx-flux-schnell/transformer_fp8 --engine-dir engine-flux-schnell/fp8 --build-static-batch
# FP4
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --version="flux.1-schnell" --fp4 --onnx-dir onnx-flux-schnell/ --model-onnx-dirs=transformer:onnx-flux-schnell/transformer_fp4 --engine-dir engine-flux-schnell/fp4
Download an example input image:
wget "https://miro.medium.com/v2/resize:fit:640/format:webp/1*iD8mUonHMgnlP0qrSx3qPg.png" -O yellow.png
Run the image-to-image pipeline:
python3 demo_img2img_flux.py "A home with 2 floors and windows. The front door is purple" --hf-token=$HF_TOKEN --input-image yellow.png --image-strength 0.95 --bf16 --onnx-dir onnx-flux-dev/bf16 --engine-dir engine-flux-dev/
wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png
FP8 ControlNet pipelines require downloading a calibration dataset and providing the path. You can use the datasets provided by Black Forest Labs here: depth | canny
You can use the --calibraton-dataset
flag to specify the path, which is set to ./{depth/canny}-eval/benchmark
by default if not provided. Note that the dataset should have inputs/
and prompts/
underneath the provided path, matching the format of the BFL dataset.
# BF16
python3 demo_img2img_flux.py "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts." --version="flux.1-dev-depth" --hf-token=$HF_TOKEN --guidance-scale 10 --control-image robot.png --bf16 --denoising-steps 30 --onnx-dir onnx-flux-dev-depth/ --model-onnx-dirs=transformer:onnx-flux-dev-depth/transformer_bf16 --engine-dir engine-flux-dev-depth/bf16
# FP8 using pre-exported ONNX models
python3 demo_img2img_flux.py "A robot made of exotic candies" --version="flux.1-dev-depth" --hf-token=$HF_TOKEN --guidance-scale 10 --control-image robot.png --fp8 --denoising-steps 30 --onnx-dir onnx-flux-dev-depth/ --model-onnx-dirs=transformer:onnx-flux-dev-depth/transformer_fp8 --engine-dir engine-flux-dev-depth/fp8 --build-static-batch
# FP8 using native ONNX export
rm -rf onnx/* engine/* && python3 demo_img2img_flux.py "A robot made of exotic candies" --version="flux.1-dev-depth" --hf-token=$HF_TOKEN --guidance-scale 10 --control-image robot.png --fp8 --denoising-steps 30
# FP4
python3 demo_img2img_flux.py "A robot made of exotic candies" --version="flux.1-dev-depth" --hf-token=$HF_TOKEN --guidance-scale 10 --control-image robot.png --fp4 --denoising-steps 30 --onnx-dir onnx-flux-dev-depth/ --model-onnx-dirs=transformer:onnx-flux-dev-depth/transformer_fp4 --engine-dir engine-flux-dev-depth/fp4
# BF16
python3 demo_img2img_flux.py "a robot made out of gold" --version="flux.1-dev-canny" --hf-token=$HF_TOKEN --guidance-scale 30 --control-image robot.png --bf16 --onnx-dir onnx-flux-dev-canny/ --model-onnx-dirs=transformer:onnx-flux-dev-canny/transformer_bf16 --engine-dir engine-flux-dev-canny/bf16
# FP8 using pre-exported ONNX models
python3 demo_img2img_flux.py "a robot made out of gold" --version="flux.1-dev-canny" --hf-token=$HF_TOKEN --guidance-scale 30 --control-image robot.png --fp8 --onnx-dir onnx-flux-dev-canny/ --model-onnx-dirs=transformer:onnx-flux-dev-canny/transformer_fp8 --engine-dir engine-flux-dev-canny/fp8 --build-static-batch
# FP8 using native ONNX export
rm -rf onnx/* engine/* && python3 demo_img2img_flux.py "a robot made out of gold" --version="flux.1-dev-canny" --hf-token=$HF_TOKEN --guidance-scale 30 --control-image robot.png --fp8 --calibration-dataset {custom/dataset/path}
# FP4
python3 demo_img2img_flux.py "a robot made out of gold" --version="flux.1-dev-canny" --hf-token=$HF_TOKEN --guidance-scale 30 --control-image robot.png --fp4 --onnx-dir onnx-flux-dev-canny/ --model-onnx-dirs=transformer:onnx-flux-dev-canny/transformer_fp4 --engine-dir engine-flux-dev-canny/fp4
Use the --onnx-export-only
flag to export ONNX models on a higher-VRAM device. The exported ONNX models can be used on a device with lower VRAM for the engine build and inference steps.
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --onnx-export-only
--low-vram
: Enables model-offloading for reduced VRAM usage.--ws
: Enables weight streaming in TensorRT engines.--t5-ws-percentage
and--transformer-ws-percentage
: Set runtime weight streaming budgets.--build-static-batch
: Build all engines using static dimensions to lower the required activation memory. This will limit inference to the specified spatial dimensions.
Memory usage captured below excludes the ONNX export step, and assumes use of the --build-static-batch
flag to reduce activation VRAM usage. Users can either use pre-exported ONNX models or export the models separately on a higher-VRAM device using --onnx-export-only.
Precision | Default VRAM Usage | With --low-vram |
---|---|---|
FP16 | 39.3 GB | 23.9 GB |
BF16 | 35.7 GB | 23.9 GB |
FP8 | 24.6 GB | 14.9 GB |
FP4 | 21.67 GB | 11.1 GB |
NOTE: The FP8 and FP4 Pipelines are supported on Hopper/Ada/Blackwell devices only. The FP4 pipeline is most performant on Blackwell devices.
The directories specified in --model-onnx-dirs
will override the directory set in --onnx-dir
. Unspecified models will continue to use the directory set in --onnx-dir
.
Suppose the model storage locations are as following:
- transformer model ONNX files are saved at
./onnx_folder_1/transformer
and./onnx_folder_1/transformer.opt
. - vae model ONNX files are saved in
./onnx_folder_2/vae
and./onnx_folder_2/vae.opt
. - Other models (t5 and clip) are still under
./onnx/
.
The corresponding command to run the pipeline:
python3 demo_txt2img_flux.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN --onnx-dir=onnx --model-onnx-dirs=transformer:onnx_folder_1,vae:onnx_folder_2
- Noise scheduler can be set using
--scheduler <scheduler>
. Note: not all schedulers are available for every version. - To accelerate engine building time use
--timing-cache <path to cache file>
. The cache file will be created if it does not already exist. Note that performance may degrade if cache files are used across multiple GPU targets. It is recommended to use timing caches only during development. To achieve the best perfromance in deployment, please build engines without timing cache. - Specify new directories for storing onnx and engine files when switching between versions, LoRAs, ControlNets, etc. This can be done using
--onnx-dir <new onnx dir>
and--engine-dir <new engine dir>
. - Inference performance can be improved by enabling CUDA graphs using
--use-cuda-graph
. Enabling CUDA graphs requires fixed input shapes, so this flag must be combined with--build-static-batch
and cannot be combined with--build-dynamic-shape
.