You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The precision of FP16 is enough for {addition,multiplication} of FP8, even though FP8-acc-FP32 has the same FLOPS with FP8-acc-FP16 on H100. . Specifying FP16 compute type is more precise, and the FLOPS of FP8-acc-FP16 may be boosted in future arch.
Please keep in mind that with TensorCores the accumulator is used to store the result of not just a single multiplication and addition of 2 FP8 values, but of a long series of those multiplication and additions. Every element of such sum could have different magnitude and the precision of FP16 could be inadequate to get the precise output. In the internal experiments we saw convergence issues when using the lower precision accumulator.
TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu
Lines 119 to 135 in bfe21c3
The precision of FP16 is enough for {addition,multiplication} of FP8, even though FP8-acc-FP32 has the same FLOPS with FP8-acc-FP16 on H100.
. Specifying FP16 compute type is more precise, and the FLOPS of FP8-acc-FP16 may be boosted in future arch.
The above image is from https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
The text was updated successfully, but these errors were encountered: