FLUX.1-dev FP8 Example Code: tmpxft_00000788_00000000-10_fp8_marlin.cudafe1.cpp #10467

nitinmukesh · 2025-01-06T06:42:14Z

Describe the bug

Unable to inference using Flux FP8

Reproduction

https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux#single-file-loading-for-the-fluxtransformer2dmodel

import torch
from diffusers import FluxTransformer2DModel, FluxPipeline
from transformers import T5EncoderModel, CLIPTextModel
from optimum.quanto import freeze, qfloat8, quantize

bfl_repo = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16

transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)

text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2

pipe.enable_model_cpu_offload()

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt,
    guidance_scale=3.5,
    output_type="pil",
    num_inference_steps=20,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]

image.save("flux-fp8-dev.png")

Logs

Attached logs

System Info

Windows 11

(venv) C:\ai1\diffuser_t2i>python --version
Python 3.10.11

(venv) C:\ai1\diffuser_t2i>echo %CUDA_PATH%
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6

(venv) C:\ai1\diffuser_t2i>pip list
Package            Version
------------------ ------------
accelerate         1.2.1
aiofiles           23.2.1
annotated-types    0.7.0
anyio              4.7.0
certifi            2024.12.14
charset-normalizer 3.4.1
click              8.1.8
colorama           0.4.6
diffusers          0.33.0.dev0
einops             0.8.0
exceptiongroup     1.2.2
fastapi            0.115.6
ffmpy              0.5.0
filelock           3.16.1
fsspec             2024.12.0
gguf               0.13.0
gradio             5.9.1
gradio_client      1.5.2
h11                0.14.0
httpcore           1.0.7
httpx              0.28.1
huggingface-hub    0.25.2
idna               3.10
imageio            2.36.1
imageio-ffmpeg     0.5.1
importlib_metadata 8.5.0
Jinja2             3.1.5
markdown-it-py     3.0.0
MarkupSafe         2.1.5
mdurl              0.1.2
mpmath             1.3.0
networkx           3.4.2
ninja              1.11.1.3
numpy              2.2.1
opencv-python      4.10.0.84
optimum-quanto     0.2.6
orjson             3.10.13
packaging          24.2
pandas             2.2.3
pillow             11.1.0
pip                23.0.1
protobuf           5.29.2
psutil             6.1.1
pydantic           2.10.4
pydantic_core      2.27.2
pydub              0.25.1
Pygments           2.18.0
python-dateutil    2.9.0.post0
python-multipart   0.0.20
pytz               2024.2
PyYAML             6.0.2
regex              2024.11.6
requests           2.32.3
rich               13.9.4
ruff               0.8.6
safehttpx          0.1.6
safetensors        0.5.0
semantic-version   2.10.0
sentencepiece      0.2.0
setuptools         65.5.0
shellingham        1.5.4
six                1.17.0
sniffio            1.3.1
starlette          0.41.3
sympy              1.13.1
tokenizers         0.21.0
tomlkit            0.13.2
torch              2.5.1+cu124
torchvision        0.20.1+cu124
tqdm               4.67.1
transformers       4.47.1
typer              0.15.1
typing_extensions  4.12.2
tzdata             2024.2
urllib3            2.3.0
uvicorn            0.34.0
websockets         14.1
wheel              0.45.1
zipp               3.21.0

Who can help?

You tell me who will help me resolve this issue :)

The text was updated successfully, but these errors were encountered:

nitinmukesh · 2025-01-06T06:56:57Z

I ran the command again without any changes

(venv) C:\ai1\diffuser_t2i>python FLUX.py
Downloading shards: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 293.45it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 2/2 [00:01<00:00,  1.88it/s]
Loading pipeline components...:  20%|██████▏                        | 1/5 [00:00<00:00,  4.64it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████| 5/5 [00:01<00:00,  3.37it/s]
C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Traceback (most recent call last):
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2104, in _run_ninja_build
    subprocess.run(
  File "C:\Program Files\Python310\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\ai1\diffuser_t2i\FLUX.py", line 24, in <module>
    image = pipe(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\diffusers\pipelines\flux\pipeline_flux.py", line 783, in __call__
    ) = self.encode_prompt(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\diffusers\pipelines\flux\pipeline_flux.py", line 370, in encode_prompt
    prompt_embeds = self._get_t5_prompt_embeds(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\diffusers\pipelines\flux\pipeline_flux.py", line 256, in _get_t5_prompt_embeds
    prompt_embeds = self.text_encoder_2(text_input_ids.to(device), output_hidden_states=False)[0]
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\accelerate\hooks.py", line 708, in pre_forward
    module.to(self.execution_device)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\transformers\modeling_utils.py", line 3164, in to
    return super().to(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1340, in to
    return self._apply(convert)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
    module._apply(fn)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
    module._apply(fn)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply
    param_applied = fn(param)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1326, in convert
    return t.to(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\qbytes.py", line 272, in __torch_function__
    return func(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\qbytes.py", line 298, in __torch_dispatch__
    return WeightQBytesTensor.create(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\qbytes.py", line 139, in create
    return MarlinF8QBytesTensor(qtype, axis, size, stride, data, scale, requires_grad)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\marlin\fp8\qbits.py", line 79, in __init__
    data_packed = MarlinF8PackedTensor.pack(data)  # pack fp8 data to in32, and apply marlier re-ordering.
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\marlin\fp8\packed.py", line 183, in pack
    data_int32 = torch.ops.quanto.pack_fp8_marlin(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\__init__.py", line 167, in gptq_marlin_repack
    return ext.lib.gptq_marlin_repack(b_q_weight, perm, size_k, size_n, num_bits)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\extension.py", line 44, in lib
    self._lib = load(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1314, in load
    return _jit_compile(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1721, in _jit_compile
    _write_ninja_file_and_build_library(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1833, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2120, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'quanto_cuda': [1/2] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc --generate-dependencies-with-compile --dependency-output gemm_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\TH -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" "-IC:\Program Files\Python310\Include" -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17 --expt-extended-lambda --use_fast_math -DQUANTO_CUDA_ARCH=890 -c C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu -o gemm_cuda.cuda.o
FAILED: gemm_cuda.cuda.o
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc --generate-dependencies-with-compile --dependency-output gemm_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\TH -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" "-IC:\Program Files\Python310\Include" -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17 --expt-extended-lambda --use_fast_math -DQUANTO_CUDA_ARCH=890 -c C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu -o gemm_cuda.cuda.o
C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(95): error: identifier "__asm__" is undefined
    __asm__ __volatile__(
    ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(98): error: expected a ")"
        : "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[0]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[1]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[2]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[3])
        ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(104): error: identifier "__asm__" is undefined
    __asm__ __volatile__(
    ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(107): error: expected a ")"
        : "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[0]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[1]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[2]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[3])
        ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(126): error: identifier "__asm__" is undefined
    __asm__ __volatile__(
    ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(129): error: expected a ")"
        : "=f"(((float *)C_warp)[0]), "=f"(((float *)C_warp)[1]), "=f"(((float *)C_warp)[2]), "=f"(((float *)C_warp)[3])
        ^

6 errors detected in the compilation of "C:/ai1/diffuser_t2i/venv/lib/site-packages/optimum/quanto/library/extensions/cuda/awq/v2/gemm_cuda.cu".
gemm_cuda.cu
ninja: build stopped: subcommand failed.

nitinmukesh · 2025-01-06T06:59:41Z

So here is the culprit
huggingface/optimum-quanto#360

I guess examples can mention that optimum-quanto is not supported on Windows. Several others are facing the same issue, without solution.

pip uninstall optimum-quanto worked for me. Moving on to next example.

KMiNT21 · 2025-01-06T16:24:42Z

I guess examples can mention that optimum-quanto is not supported on Windows. Several others are facing the same issue, without solution.

You can use BitsandBytes instead. Works perfect.

nitinmukesh · 2025-01-06T16:47:38Z

@KMiNT21

Done. Thanks for information.

nitinmukesh added the bug Something isn't working label Jan 6, 2025

nitinmukesh mentioned this issue Jan 6, 2025

windows ninja error huggingface/optimum-quanto#360

Closed

nitinmukesh closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLUX.1-dev FP8 Example Code: tmpxft_00000788_00000000-10_fp8_marlin.cudafe1.cpp #10467

FLUX.1-dev FP8 Example Code: tmpxft_00000788_00000000-10_fp8_marlin.cudafe1.cpp #10467

nitinmukesh commented Jan 6, 2025 •

edited

Loading

nitinmukesh commented Jan 6, 2025

nitinmukesh commented Jan 6, 2025 •

edited

Loading

KMiNT21 commented Jan 6, 2025

nitinmukesh commented Jan 6, 2025 •

edited

Loading

FLUX.1-dev FP8 Example Code: tmpxft_00000788_00000000-10_fp8_marlin.cudafe1.cpp #10467

FLUX.1-dev FP8 Example Code: tmpxft_00000788_00000000-10_fp8_marlin.cudafe1.cpp #10467

Comments

nitinmukesh commented Jan 6, 2025 • edited Loading

Describe the bug

Reproduction

Logs

System Info

Who can help?

nitinmukesh commented Jan 6, 2025

nitinmukesh commented Jan 6, 2025 • edited Loading

KMiNT21 commented Jan 6, 2025

nitinmukesh commented Jan 6, 2025 • edited Loading

nitinmukesh commented Jan 6, 2025 •

edited

Loading

nitinmukesh commented Jan 6, 2025 •

edited

Loading

nitinmukesh commented Jan 6, 2025 •

edited

Loading