-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement fused kernel for FP8 scale update #593
Conversation
Signed-off-by: Tim Moon <tmoon@nvidia.com>
/te-ci pytorch |
From my profiling,
amax_and_scale_update ?
|
Right, instead of writing another kernel in the framework-aware portion, let's actually add the Paddle kernel to the common part and use in both places. |
Add unit test. Signed-off-by: Tim Moon <tmoon@nvidia.com>
One thing that'll make this tricky is that the Paddle kernel is in-place. The PyTorch kernel includes the amax history roll, so making that in-place would make the kernel much more complicated than it's worth. Perhaps we should change the Paddle implementation to be out-of-place? |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
I've moved the fused kernel to the core C++ library and modified the PyTorch and Paddle extensions so they both use it. It turns out that making the kernel work in-place was not difficult. This includes the Paddle bugfix from #633. |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
/te-ci |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
0d4a3cd
to
58b9a23
Compare
/te-ci |
/te-ci |
The PyTorch and JAX tests as passing in pipeline 12465152 and the Paddle tests are failing due to a problem with cuDNN v9 in the upstream Paddle container. The Paddle tests pass when I use an older Paddle container in pipeline 12474691. This is ready to merge. |
/te-ci |
Have we tested e2e numerics for this change against the previous versions? @timmoon10 |
The tests pass, but I haven't tried full-scale training runs. |
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
d54bbbe
to
c597131
Compare
@ksivaman The numerics are tested thoroughly in https://github.com/NVIDIA/TransformerEngine/blob/7fc00c0819b3249d1abbb935719a81202d097e9d/tests/pytorch/test_recipe.py, so I think full-scale convergence runs are excessively careful. |
/te-ci |
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Running GPT 175B, I've found that the forward pass is often bottlenecked by GPU kernel launch overheads, especially in the forward pass. Profiling the Python code finds that ~20% of the Transformer layer forward pass is spent in
amax_and_scale_update
(compare to 9% launching GEMM kernels). This function is called in every forward pass ofLinear
,LayerNormLinear
, andLayerNormMLP
, and each call involves ~10 small GPU operations. nvfuser andtorch.compile
do fuse some of the operations, leading to some improvement in GPU runtime, but the extra CPU overhead results in somewhat worse performance in the CPU-bound case.This is an experiment with using a hand-written kernel to reduce these overheads. Alternative approaches: