You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I saw the reference perf you provided is based on time/step. But I'm wondering about the e2e time.
I ran the same workload on H100-sxm5, at bs4, fp32, 4steps, scale8 (Same as the recomanned setting.)
Per step dur is 0.1647. So 0.1647*4steps = 0.6588s. But if I insert time tag in python, that gives e2e duration as 1.147s. I assume the trailing decoder and clip text encoder would takes up ~50% of entire pipline? Does that time consumpation match your experiment? If so, do you have plan to optmize those non unet component?
The actual code I used:
from diffusers import DiffusionPipeline
import torch
import argparse
import time
import os
pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float32)
parser = argparse.ArgumentParser(description="Generate images using a diffusion model.")
parser.add_argument('--bs', type=int, default=1, help='Batch size for image generation')
args = parser.parse_args()
pipe.to("cuda")
prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"
prompt_lst = [prompt] * args.bs
num_inference_steps = 4
output_dir = "generated_images"
os.makedirs(output_dir, exist_ok=True)
dur = 0
warmup = 5
iter = 50
for i in range(warmup):
images = pipe(prompt=prompt_lst, num_inference_steps=num_inference_steps, guidance_scale=8.0, lcm_origin_steps=50, output_type="pil").images
s_time = time.time()
for i in range(iter):
images = pipe(prompt=prompt_lst, num_inference_steps=num_inference_steps, guidance_scale=8.0, lcm_origin_steps=50, output_type="pil").images
e_time = time.time()
print((e_time-s_time)/iter)
print(args.bs/((e_time-s_time)/iter))```
The text was updated successfully, but these errors were encountered:
Hi,
I saw the reference perf you provided is based on time/step. But I'm wondering about the e2e time.
I ran the same workload on H100-sxm5, at bs4, fp32, 4steps, scale8 (Same as the recomanned setting.)
Per step dur is 0.1647. So 0.1647*4steps = 0.6588s. But if I insert time tag in python, that gives e2e duration as 1.147s. I assume the trailing decoder and clip text encoder would takes up ~50% of entire pipline? Does that time consumpation match your experiment? If so, do you have plan to optmize those non unet component?
The actual code I used:
The text was updated successfully, but these errors were encountered: