Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS #44

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sazczmh
Copy link
Contributor

@sazczmh sazczmh commented Mar 6, 2025

Configuration algorithms tuned to minimize the impact of tail effects, for 4096x7168x16384, approximately 1.6% performance gain.

M N K Base_BlockS Computation Tail_BlockS Computation Speedup
64 2112 7168 132 195 TFLOPS 132 194 TFLOPS -0.51%
64 24576 1536 132 281 TFLOPS 128 283 TFLOPS 0.71%
64 32768 512 132 217 TFLOPS 128 218 TFLOPS 0.46%
64 7168 16384 132 329 TFLOPS 132 328 TFLOPS -0.30%
64 4096 7168 132 269 TFLOPS 132 268 TFLOPS -0.37%
64 7168 2048 132 272 TFLOPS 132 269 TFLOPS -1.10%
128 2112 7168 132 341 TFLOPS 132 340 TFLOPS -0.29%
128 24576 1536 132 521 TFLOPS 128 520 TFLOPS -0.19%
128 32768 512 132 358 TFLOPS 128 365 TFLOPS 1.96%
128 7168 16384 132 631 TFLOPS 132 633 TFLOPS 0.32%
128 4096 7168 132 505 TFLOPS 132 504 TFLOPS -0.20%
128 7168 2048 132 486 TFLOPS 132 485 TFLOPS -0.21%
4096 2112 7168 132 1054 TFLOPS 124 1071 TFLOPS 1.61%
4096 24576 1536 132 992 TFLOPS 132 988 TFLOPS -0.40%
4096 32768 512 132 595 TFLOPS 132 591 TFLOPS -0.67%
4096 7168 16384 132 1348 TFLOPS 128 1369 TFLOPS 1.56%
4096 4096 7168 132 1325 TFLOPS 128 1333 TFLOPS 0.60%
4096 7168 2048 132 1026 TFLOPS 128 1022 TFLOPS -0.39%

Together with the optimization of FFMA interleaving, for 4096x7168x16384, approximately 4% performance gain,up to 1402TFLOPS

M N K Base_BlockS Computation Tail+FFMA_BlockS Computation Speedup
64 2112 7168 132 195 TFLOPS 132 194 TFLOPS -0.51%
64 24576 1536 132 281 TFLOPS 128 282 TFLOPS 0.36%
64 32768 512 132 217 TFLOPS 128 219 TFLOPS 0.92%
64 7168 16384 132 329 TFLOPS 132 328 TFLOPS -0.30%
64 4096 7168 132 269 TFLOPS 132 270 TFLOPS 0.37%
64 7168 2048 132 272 TFLOPS 132 271 TFLOPS -0.37%
128 2112 7168 132 341 TFLOPS 132 339 TFLOPS -0.59%
128 24576 1536 132 521 TFLOPS 128 523 TFLOPS 0.38%
128 32768 512 132 358 TFLOPS 128 355 TFLOPS -0.84%
128 7168 16384 132 631 TFLOPS 132 633 TFLOPS 0.32%
128 4096 7168 132 505 TFLOPS 132 511 TFLOPS 1.19%
128 7168 2048 132 486 TFLOPS 132 485 TFLOPS -0.21%
4096 2112 7168 132 1054 TFLOPS 124 1065 TFLOPS 1.04%
4096 24576 1536 132 992 TFLOPS 132 1009 TFLOPS 1.71%
4096 32768 512 132 595 TFLOPS 132 592 TFLOPS -0.50%
4096 7168 16384 132 1348 TFLOPS 128 1402 TFLOPS 4.01%
4096 4096 7168 132 1325 TFLOPS 128 1355 TFLOPS 2.26%
4096 7168 2048 132 1026 TFLOPS 128 1042 TFLOPS 1.56%

Test on H100-SXM && CUDA 12.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant