When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F` #758

condy0919 · 2024-04-08T03:00:11Z

TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu

Lines 119 to 135 in bfe21c3

    
           cublasComputeType_t gemm_compute_type = CUBLAS_COMPUTE_32F; 
        
           if (A_type == CUDA_R_32F && B_type == CUDA_R_32F && D_type == CUDA_R_32F) { 
        
             gemm_compute_type = CUBLAS_COMPUTE_32F_FAST_TF32; 
        
           } 
        
           // Create matrix descriptors. Not setting any extra attributes. 
        
           NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Adesc, A_type, 
        
                                                        transa == CUBLAS_OP_N ? m : k, 
        
                                                        transa == CUBLAS_OP_N ? k : m, 
        
                                                        lda)); 
        
           NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Bdesc, B_type, 
        
                                                        transb == CUBLAS_OP_N ? k : n, 
        
                                                        transb == CUBLAS_OP_N ? n : k, 
        
                                                        ldb)); 
        
           NVTE_CHECK_CUBLAS(cublasLtMatrixLayoutCreate(&Ddesc, D_type, m, n, ldd)); 
        
           NVTE_CHECK_CUBLAS(cublasLtMatmulDescCreate(&operationDesc, gemm_compute_type, CUDA_R_32F));

The precision of FP16 is enough for {addition,multiplication} of FP8, even though FP8-acc-FP32 has the same FLOPS with FP8-acc-FP16 on H100.
. Specifying FP16 compute type is more precise, and the FLOPS of FP8-acc-FP16 may be boosted in future arch.

The above image is from https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

ptrendx · 2024-04-08T23:04:10Z

Please keep in mind that with TensorCores the accumulator is used to store the result of not just a single multiplication and addition of 2 FP8 values, but of a long series of those multiplication and additions. Every element of such sum could have different magnitude and the precision of FP16 could be inadequate to get the precise output. In the internal experiments we saw convergence issues when using the lower precision accumulator.

condy0919 · 2024-04-10T07:04:08Z

Thanks for your detailed explanation.

condy0919 closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F` #758

When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F` #758

condy0919 commented Apr 8, 2024 •

edited

Loading

ptrendx commented Apr 8, 2024

condy0919 commented Apr 10, 2024

When A and B are fp8 tensors, the compute type could be CUBLAS_COMPUTE_16F #758

When A and B are fp8 tensors, the compute type could be CUBLAS_COMPUTE_16F #758

Comments

condy0919 commented Apr 8, 2024 • edited Loading

ptrendx commented Apr 8, 2024

condy0919 commented Apr 10, 2024

When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F` #758

When A and B are fp8 tensors, the compute type could be `CUBLAS_COMPUTE_16F` #758

condy0919 commented Apr 8, 2024 •

edited

Loading