Add kernel‘s config cache and improve TMA alignment #47

DXHPC · 2025-03-07T05:27:39Z

When testing the function deep_gemm.gemm_fp8_fp8_bf16_nt, we observed that for certain examples, especially matrices with small shapes, the overhead outside of the GEMM kernel's execution was non-negligible. By using PyTorch's profiler, we found that the execution of the functions get_col_major_tma_aligned_tensor and get_best_configs introduces significant overhead.

We tried to perform memory alignment in advance by calling the function get_col_major_tma_aligned_tensor before before invoking the function deep_gemm.gemm_fp8_fp8_bf16_nt. But we still noticed that the alignment-related operations were triggered during the execution of deep_gemm.gemm_fp8_fp8_bf16_nt. Additionally, the function get_best_configs redundantly executes for the same inputs, yielding the same results, which introduces unnecessary overhead.
First, we fixed the improper handling of 2D matrices in function get_col_major_tma_aligned_tensor. Additionally, we added a cache to obtain the optimal configuration before the GEMM kernel launches, which reduces overhead for inputs of the same size.

soundOfDestiny · 2025-03-07T08:10:17Z

Thanks!

the overhead outside of the GEMM kernel's execution was non-negligible

Indeed. get_col_major_tma_aligned_tensor is for demo use and can be fused into one CUDA kernel when we want to achieve the best performance.

dxh added 3 commits March 7, 2025 10:54

config cache and tensor alignment fix

ba142b8

tensor alignment fix

0b5d353

fix typo

de1603d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add kernel‘s config cache and improve TMA alignment #47

Add kernel‘s config cache and improve TMA alignment #47

DXHPC commented Mar 7, 2025 •

edited

Loading

soundOfDestiny commented Mar 7, 2025

Add kernel‘s config cache and improve TMA alignment #47

Are you sure you want to change the base?

Add kernel‘s config cache and improve TMA alignment #47

Conversation

DXHPC commented Mar 7, 2025 • edited Loading

soundOfDestiny commented Mar 7, 2025

DXHPC commented Mar 7, 2025 •

edited

Loading