bench

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
bench_baseline.py		bench_baseline.py
bench_fa3.py		bench_fa3.py
bench_fa3_fp8.py		bench_fa3_fp8.py
bench_qk_int8_pv_fp16_cuda.py		bench_qk_int8_pv_fp16_cuda.py
bench_qk_int8_pv_fp16_triton.py		bench_qk_int8_pv_fp16_triton.py
bench_qk_int8_pv_fp8_cuda.py		bench_qk_int8_pv_fp8_cuda.py
bench_qk_int8_pv_fp8_cuda_sm90.py		bench_qk_int8_pv_fp8_cuda_sm90.py
utils.py		utils.py

README.md

Kernel Benchmarking

Here we provide script to benchmark the speed of different kernels including SageAttention, FlashAttention2 and FlashAttention3. Please ensure that the flash-attn package is installed, as we use its benchmark API for performance evaluation.

Install FlashAttention3

To benchmark FlashAttention3 and its FP8 variant, make sure you follow the installation guide below since the interface of FlashAttention3 is not stable yet.

git clone https://github.com/Dao-AILab/flash-attention.git --recursive
git checkout b7d29fb3b79f0b78b1c369a52aaa6628dabfb0d7 # 2.7.2 release
cd hopper
python setup.py install

Available Arguments

Some kernels support passing the following arguments:

--quant_gran: Quantization granularity for $Q$ and $K$ .
--pv_accum_dtype: Accumulation data type for $P V$ . Those with + corresponds to the two-level accumulation strategy.

Example usage:

# on RTX 4090
python bench_qk_int8_pv_fp8_cuda.py --pv_accum_dtype fp32+fp32 --quant_gran per_warp

# on H100
python bench_qk_int8_pv_fp8_cuda_sm90.py --pv_accum_dtype fp32+fp32 --quant_gran per_thread