The possibility of supporting GPUs with other architectures #33

ziyaxuanyi · 2024-11-11T03:42:28Z

Can I extend the support for graphics cards with other architectures, such as the 3090? I tested on the 3090 and found that FP8 quantization not only fails to accelerate the model, but also slows down the inference speed significantly

aredden · 2024-11-14T17:55:19Z

Well, fp8 matmul is only possible on ADA devices, since there are cuda instructions for performing matrix multiplication with those tensors. If you don't have an ADA device, then the only thing that it can do is dequantize the tensor into float16, bfloat16 or float32 and then afterwards do a matrix multiplication, which would of course be significantly slower than a direct matrix multiplication on fp8 tensors. For a 3090, that is the only way to use a float8 tensor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The possibility of supporting GPUs with other architectures #33

The possibility of supporting GPUs with other architectures #33

ziyaxuanyi commented Nov 11, 2024

aredden commented Nov 14, 2024

The possibility of supporting GPUs with other architectures #33

The possibility of supporting GPUs with other architectures #33

Comments

ziyaxuanyi commented Nov 11, 2024

aredden commented Nov 14, 2024