Performance: Configuration algorithms tuned to minimize the impact of tail effects, now up to 1402 TFLOPS #44

sazczmh · 2025-03-06T08:42:57Z

Configuration algorithms tuned to minimize the impact of tail effects, for 4096x7168x16384, approximately 1.6% performance gain.

M	N	K	Base_BlockS	Computation	Tail_BlockS	Computation	Speedup
64	2112	7168	132	195 TFLOPS	132	194 TFLOPS	-0.51%
64	24576	1536	132	281 TFLOPS	128	283 TFLOPS	0.71%
64	32768	512	132	217 TFLOPS	128	218 TFLOPS	0.46%
64	7168	16384	132	329 TFLOPS	132	328 TFLOPS	-0.30%
64	4096	7168	132	269 TFLOPS	132	268 TFLOPS	-0.37%
64	7168	2048	132	272 TFLOPS	132	269 TFLOPS	-1.10%
128	2112	7168	132	341 TFLOPS	132	340 TFLOPS	-0.29%
128	24576	1536	132	521 TFLOPS	128	520 TFLOPS	-0.19%
128	32768	512	132	358 TFLOPS	128	365 TFLOPS	1.96%
128	7168	16384	132	631 TFLOPS	132	633 TFLOPS	0.32%
128	4096	7168	132	505 TFLOPS	132	504 TFLOPS	-0.20%
128	7168	2048	132	486 TFLOPS	132	485 TFLOPS	-0.21%
4096	2112	7168	132	1054 TFLOPS	124	1071 TFLOPS	1.61%
4096	24576	1536	132	992 TFLOPS	132	988 TFLOPS	-0.40%
4096	32768	512	132	595 TFLOPS	132	591 TFLOPS	-0.67%
4096	7168	16384	132	1348 TFLOPS	128	1369 TFLOPS	1.56%
4096	4096	7168	132	1325 TFLOPS	128	1333 TFLOPS	0.60%
4096	7168	2048	132	1026 TFLOPS	128	1022 TFLOPS	-0.39%

Together with the optimization of FFMA interleaving, for 4096x7168x16384, approximately 4% performance gain，up to 1402TFLOPS

M	N	K	Base_BlockS	Computation	Tail+FFMA_BlockS	Computation	Speedup
64	2112	7168	132	195 TFLOPS	132	194 TFLOPS	-0.51%
64	24576	1536	132	281 TFLOPS	128	282 TFLOPS	0.36%
64	32768	512	132	217 TFLOPS	128	219 TFLOPS	0.92%
64	7168	16384	132	329 TFLOPS	132	328 TFLOPS	-0.30%
64	4096	7168	132	269 TFLOPS	132	270 TFLOPS	0.37%
64	7168	2048	132	272 TFLOPS	132	271 TFLOPS	-0.37%
128	2112	7168	132	341 TFLOPS	132	339 TFLOPS	-0.59%
128	24576	1536	132	521 TFLOPS	128	523 TFLOPS	0.38%
128	32768	512	132	358 TFLOPS	128	355 TFLOPS	-0.84%
128	7168	16384	132	631 TFLOPS	132	633 TFLOPS	0.32%
128	4096	7168	132	505 TFLOPS	132	511 TFLOPS	1.19%
128	7168	2048	132	486 TFLOPS	132	485 TFLOPS	-0.21%
4096	2112	7168	132	1054 TFLOPS	124	1065 TFLOPS	1.04%
4096	24576	1536	132	992 TFLOPS	132	1009 TFLOPS	1.71%
4096	32768	512	132	595 TFLOPS	132	592 TFLOPS	-0.50%
4096	7168	16384	132	1348 TFLOPS	128	1402 TFLOPS	4.01%
4096	4096	7168	132	1325 TFLOPS	128	1355 TFLOPS	2.26%
4096	7168	2048	132	1026 TFLOPS	128	1042 TFLOPS	1.56%

Test on H100-SXM && CUDA 12.8.

… tail effects, now up to 1402 TFLOPS

Performance: Configuration algorithms tuned to minimize the impact of…

4adae1f

… tail effects, now up to 1402 TFLOPS

Provide feedback