Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLUX.1-dev FP8 Example Code: tmpxft_00000788_00000000-10_fp8_marlin.cudafe1.cpp #10467

Closed
nitinmukesh opened this issue Jan 6, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@nitinmukesh
Copy link

nitinmukesh commented Jan 6, 2025

Describe the bug

Unable to inference using Flux FP8

Logs
FP8_logs.txt

Reproduction

https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux#single-file-loading-for-the-fluxtransformer2dmodel

import torch
from diffusers import FluxTransformer2DModel, FluxPipeline
from transformers import T5EncoderModel, CLIPTextModel
from optimum.quanto import freeze, qfloat8, quantize

bfl_repo = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16

transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)

text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2

pipe.enable_model_cpu_offload()

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt,
    guidance_scale=3.5,
    output_type="pil",
    num_inference_steps=20,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]

image.save("flux-fp8-dev.png")

Logs

Attached logs

System Info

Windows 11

(venv) C:\ai1\diffuser_t2i>python --version
Python 3.10.11

(venv) C:\ai1\diffuser_t2i>echo %CUDA_PATH%
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6
(venv) C:\ai1\diffuser_t2i>pip list
Package            Version
------------------ ------------
accelerate         1.2.1
aiofiles           23.2.1
annotated-types    0.7.0
anyio              4.7.0
certifi            2024.12.14
charset-normalizer 3.4.1
click              8.1.8
colorama           0.4.6
diffusers          0.33.0.dev0
einops             0.8.0
exceptiongroup     1.2.2
fastapi            0.115.6
ffmpy              0.5.0
filelock           3.16.1
fsspec             2024.12.0
gguf               0.13.0
gradio             5.9.1
gradio_client      1.5.2
h11                0.14.0
httpcore           1.0.7
httpx              0.28.1
huggingface-hub    0.25.2
idna               3.10
imageio            2.36.1
imageio-ffmpeg     0.5.1
importlib_metadata 8.5.0
Jinja2             3.1.5
markdown-it-py     3.0.0
MarkupSafe         2.1.5
mdurl              0.1.2
mpmath             1.3.0
networkx           3.4.2
ninja              1.11.1.3
numpy              2.2.1
opencv-python      4.10.0.84
optimum-quanto     0.2.6
orjson             3.10.13
packaging          24.2
pandas             2.2.3
pillow             11.1.0
pip                23.0.1
protobuf           5.29.2
psutil             6.1.1
pydantic           2.10.4
pydantic_core      2.27.2
pydub              0.25.1
Pygments           2.18.0
python-dateutil    2.9.0.post0
python-multipart   0.0.20
pytz               2024.2
PyYAML             6.0.2
regex              2024.11.6
requests           2.32.3
rich               13.9.4
ruff               0.8.6
safehttpx          0.1.6
safetensors        0.5.0
semantic-version   2.10.0
sentencepiece      0.2.0
setuptools         65.5.0
shellingham        1.5.4
six                1.17.0
sniffio            1.3.1
starlette          0.41.3
sympy              1.13.1
tokenizers         0.21.0
tomlkit            0.13.2
torch              2.5.1+cu124
torchvision        0.20.1+cu124
tqdm               4.67.1
transformers       4.47.1
typer              0.15.1
typing_extensions  4.12.2
tzdata             2024.2
urllib3            2.3.0
uvicorn            0.34.0
websockets         14.1
wheel              0.45.1
zipp               3.21.0

Who can help?

You tell me who will help me resolve this issue :)

@nitinmukesh nitinmukesh added the bug Something isn't working label Jan 6, 2025
@nitinmukesh
Copy link
Author

I ran the command again without any changes

(venv) C:\ai1\diffuser_t2i>python FLUX.py
Downloading shards: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 293.45it/s]
Loading checkpoint shards: 100%|████████████████████████████████████| 2/2 [00:01<00:00,  1.88it/s]
Loading pipeline components...:  20%|██████▏                        | 1/5 [00:00<00:00,  4.64it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████| 5/5 [00:01<00:00,  3.37it/s]
C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  warnings.warn(
Traceback (most recent call last):
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2104, in _run_ninja_build
    subprocess.run(
  File "C:\Program Files\Python310\lib\subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\ai1\diffuser_t2i\FLUX.py", line 24, in <module>
    image = pipe(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\diffusers\pipelines\flux\pipeline_flux.py", line 783, in __call__
    ) = self.encode_prompt(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\diffusers\pipelines\flux\pipeline_flux.py", line 370, in encode_prompt
    prompt_embeds = self._get_t5_prompt_embeds(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\diffusers\pipelines\flux\pipeline_flux.py", line 256, in _get_t5_prompt_embeds
    prompt_embeds = self.text_encoder_2(text_input_ids.to(device), output_hidden_states=False)[0]
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\accelerate\hooks.py", line 708, in pre_forward
    module.to(self.execution_device)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\transformers\modeling_utils.py", line 3164, in to
    return super().to(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1340, in to
    return self._apply(convert)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
    module._apply(fn)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
    module._apply(fn)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 900, in _apply
    module._apply(fn)
  [Previous line repeated 4 more times]
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 927, in _apply
    param_applied = fn(param)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\nn\modules\module.py", line 1326, in convert
    return t.to(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\qbytes.py", line 272, in __torch_function__
    return func(*args, **kwargs)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\qbytes.py", line 298, in __torch_dispatch__
    return WeightQBytesTensor.create(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\qbytes.py", line 139, in create
    return MarlinF8QBytesTensor(qtype, axis, size, stride, data, scale, requires_grad)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\marlin\fp8\qbits.py", line 79, in __init__
    data_packed = MarlinF8PackedTensor.pack(data)  # pack fp8 data to in32, and apply marlier re-ordering.
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\tensor\weights\marlin\fp8\packed.py", line 183, in pack
    data_int32 = torch.ops.quanto.pack_fp8_marlin(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\_ops.py", line 1116, in __call__
    return self._op(*args, **(kwargs or {}))
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\__init__.py", line 167, in gptq_marlin_repack
    return ext.lib.gptq_marlin_repack(b_q_weight, perm, size_k, size_n, num_bits)
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\extension.py", line 44, in lib
    self._lib = load(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1314, in load
    return _jit_compile(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1721, in _jit_compile
    _write_ninja_file_and_build_library(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 1833, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "C:\ai1\diffuser_t2i\venv\lib\site-packages\torch\utils\cpp_extension.py", line 2120, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'quanto_cuda': [1/2] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc --generate-dependencies-with-compile --dependency-output gemm_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\TH -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" "-IC:\Program Files\Python310\Include" -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17 --expt-extended-lambda --use_fast_math -DQUANTO_CUDA_ARCH=890 -c C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu -o gemm_cuda.cuda.o
FAILED: gemm_cuda.cuda.o
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin\nvcc --generate-dependencies-with-compile --dependency-output gemm_cuda.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4068 -Xcompiler /wd4067 -Xcompiler /wd4624 -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=quanto_cuda -DTORCH_API_INCLUDE_EXTENSION_H -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\TH -IC:\ai1\diffuser_t2i\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\include" "-IC:\Program Files\Python310\Include" -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_89,code=compute_89 -gencode=arch=compute_89,code=sm_89 -std=c++17 --expt-extended-lambda --use_fast_math -DQUANTO_CUDA_ARCH=890 -c C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu -o gemm_cuda.cuda.o
C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(95): error: identifier "__asm__" is undefined
    __asm__ __volatile__(
    ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(98): error: expected a ")"
        : "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[0]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[1]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[2]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[3])
        ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(104): error: identifier "__asm__" is undefined
    __asm__ __volatile__(
    ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(107): error: expected a ")"
        : "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[0]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[1]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[2]), "=r"(((unsigned *)(shared_warp + (ax0_0 * 8)))[3])
        ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(126): error: identifier "__asm__" is undefined
    __asm__ __volatile__(
    ^

C:\ai1\diffuser_t2i\venv\lib\site-packages\optimum\quanto\library\extensions\cuda\awq\v2\gemm_cuda.cu(129): error: expected a ")"
        : "=f"(((float *)C_warp)[0]), "=f"(((float *)C_warp)[1]), "=f"(((float *)C_warp)[2]), "=f"(((float *)C_warp)[3])
        ^

6 errors detected in the compilation of "C:/ai1/diffuser_t2i/venv/lib/site-packages/optimum/quanto/library/extensions/cuda/awq/v2/gemm_cuda.cu".
gemm_cuda.cu
ninja: build stopped: subcommand failed.

@nitinmukesh
Copy link
Author

nitinmukesh commented Jan 6, 2025

So here is the culprit
huggingface/optimum-quanto#360

I guess examples can mention that optimum-quanto is not supported on Windows. Several others are facing the same issue, without solution.

pip uninstall optimum-quanto worked for me. Moving on to next example.

@KMiNT21
Copy link

KMiNT21 commented Jan 6, 2025

I guess examples can mention that optimum-quanto is not supported on Windows. Several others are facing the same issue, without solution.

You can use BitsandBytes instead. Works perfect.

@nitinmukesh
Copy link
Author

nitinmukesh commented Jan 6, 2025

@KMiNT21

Done. Thanks for information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants