Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kernel‘s config cache and improve TMA alignment #47

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

DXHPC
Copy link

@DXHPC DXHPC commented Mar 7, 2025

When testing the function deep_gemm.gemm_fp8_fp8_bf16_nt, we observed that for certain examples, especially matrices with small shapes, the overhead outside of the GEMM kernel's execution was non-negligible. By using PyTorch's profiler, we found that the execution of the functions get_col_major_tma_aligned_tensor and get_best_configs introduces significant overhead.
截屏2025-03-07 11 53 19

We tried to perform memory alignment in advance by calling the function get_col_major_tma_aligned_tensor before before invoking the function deep_gemm.gemm_fp8_fp8_bf16_nt. But we still noticed that the alignment-related operations were triggered during the execution of deep_gemm.gemm_fp8_fp8_bf16_nt. Additionally, the function get_best_configs redundantly executes for the same inputs, yielding the same results, which introduces unnecessary overhead.
First, we fixed the improper handling of 2D matrices in function get_col_major_tma_aligned_tensor. Additionally, we added a cache to obtain the optimal configuration before the GEMM kernel launches, which reduces overhead for inputs of the same size.

@soundOfDestiny
Copy link
Collaborator

Thanks!

the overhead outside of the GEMM kernel's execution was non-negligible

Indeed. get_col_major_tma_aligned_tensor is for demo use and can be fused into one CUDA kernel when we want to achieve the best performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants