You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can I extend the support for graphics cards with other architectures, such as the 3090? I tested on the 3090 and found that FP8 quantization not only fails to accelerate the model, but also slows down the inference speed significantly
The text was updated successfully, but these errors were encountered:
Well, fp8 matmul is only possible on ADA devices, since there are cuda instructions for performing matrix multiplication with those tensors. If you don't have an ADA device, then the only thing that it can do is dequantize the tensor into float16, bfloat16 or float32 and then afterwards do a matrix multiplication, which would of course be significantly slower than a direct matrix multiplication on fp8 tensors. For a 3090, that is the only way to use a float8 tensor.
Can I extend the support for graphics cards with other architectures, such as the 3090? I tested on the 3090 and found that FP8 quantization not only fails to accelerate the model, but also slows down the inference speed significantly
The text was updated successfully, but these errors were encountered: