Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The possibility of supporting GPUs with other architectures #33

Open
ziyaxuanyi opened this issue Nov 11, 2024 · 1 comment
Open

The possibility of supporting GPUs with other architectures #33

ziyaxuanyi opened this issue Nov 11, 2024 · 1 comment

Comments

@ziyaxuanyi
Copy link

Can I extend the support for graphics cards with other architectures, such as the 3090? I tested on the 3090 and found that FP8 quantization not only fails to accelerate the model, but also slows down the inference speed significantly

@aredden
Copy link
Owner

aredden commented Nov 14, 2024

Well, fp8 matmul is only possible on ADA devices, since there are cuda instructions for performing matrix multiplication with those tensors. If you don't have an ADA device, then the only thing that it can do is dequantize the tensor into float16, bfloat16 or float32 and then afterwards do a matrix multiplication, which would of course be significantly slower than a direct matrix multiplication on fp8 tensors. For a 3090, that is the only way to use a float8 tensor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants