Implement fused kernel for FP8 scale update #593

timmoon10 · 2024-01-06T01:40:22Z

Running GPT 175B, I've found that the forward pass is often bottlenecked by GPU kernel launch overheads, especially in the forward pass. Profiling the Python code finds that ~20% of the Transformer layer forward pass is spent in amax_and_scale_update (compare to 9% launching GEMM kernels). This function is called in every forward pass of Linear, LayerNormLinear, and LayerNormMLP, and each call involves ~10 small GPU operations. nvfuser and torch.compile do fuse some of the operations, leading to some improvement in GPU runtime, but the extra CPU overhead results in somewhat worse performance in the CPU-bound case.

This is an experiment with using a hand-written kernel to reduce these overheads. Alternative approaches:

Only update scaling factors once per training step instead of once per microbatch
Batch the scale updates for all layers together

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-01-06T01:40:36Z

/te-ci pytorch

yaox12 · 2024-01-09T09:04:49Z

From my profiling, _default_get_amax is not fused well either. Can we follow paddle's

TransformerEngine/transformer_engine/paddle/fp8.py

Line 238 in 753eed3

    
           tex.amax_and_scale_update_inplace(_amax_history=fp8_meta[fp8_meta_tensor_key].amax_history,

to fused the whole amax_and_scale_update?

ptrendx · 2024-01-09T22:14:54Z

Right, instead of writing another kernel in the framework-aware portion, let's actually add the Paddle kernel to the common part and use in both places.

Add unit test. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-01-10T03:21:49Z

One thing that'll make this tricky is that the Paddle kernel is in-place. The PyTorch kernel includes the amax history roll, so making that in-place would make the kernel much more complicated than it's worth. Perhaps we should change the Paddle implementation to be out-of-place?

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-01-26T23:39:17Z

I've moved the fused kernel to the core C++ library and modified the PyTorch and Paddle extensions so they both use it. It turns out that making the kernel work in-place was not difficult.

This includes the Paddle bugfix from #633.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-01-26T23:43:46Z

/te-ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-01-29T18:46:02Z

/te-ci

timmoon10 · 2024-01-30T18:22:30Z

/te-ci

timmoon10 · 2024-01-31T05:35:47Z

The PyTorch and JAX tests as passing in pipeline 12465152 and the Paddle tests are failing due to a problem with cuDNN v9 in the upstream Paddle container. The Paddle tests pass when I use an older Paddle container in pipeline 12474691. This is ready to merge.

ksivaman · 2024-02-03T06:44:19Z

/te-ci

ksivaman · 2024-02-03T06:52:30Z

Have we tested e2e numerics for this change against the previous versions? @timmoon10

timmoon10 · 2024-02-05T21:06:04Z

The tests pass, but I haven't tried full-scale training runs.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-02-08T01:32:14Z

@ksivaman The numerics are tested thoroughly in https://github.com/NVIDIA/TransformerEngine/blob/7fc00c0819b3249d1abbb935719a81202d097e9d/tests/pytorch/test_recipe.py, so I think full-scale convergence runs are excessively careful.

timmoon10 · 2024-02-08T01:32:20Z

/te-ci

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2024-02-08T22:03:24Z

After internal discussion with @ptrendx and @ksivaman, there is no objection to merging without review.

Implement fused kernel for FP8 scale update

c04e42e

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the enhancement New feature or request label Jan 6, 2024

Add fused kernel for amax and scale update

1738d48

Add unit test. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added 8 commits January 26, 2024 00:51

Replace paddle.fluid imports with paddle.base

df088dc

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fused-fp8-scale-update

8b6b09b

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Move fused kernel to core library

0c169c9

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug test

a10b0dc

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Use FP8 update kernel in Paddle

44d9604

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'paddle-fluid-bugfix' into fused-fp8-scale-update

20b9829

Debug FP8 scale update in Paddle

080e1b4

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fused-fp8-scale-update

d0c22c8

timmoon10 changed the title ~~[PyTorch] Implement fused kernel for FP8 scale update~~ Implement fused kernel for FP8 scale update Jan 26, 2024

timmoon10 marked this pull request as ready for review January 26, 2024 23:36

timmoon10 requested review from ptrendx and zlsh80826 January 26, 2024 23:36

Fix lint errors

cb47377

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug Paddle test failures

58b9a23

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 force-pushed the fused-fp8-scale-update branch from 0d4a3cd to 58b9a23 Compare January 29, 2024 18:45

timmoon10 requested a review from ksivaman January 29, 2024 23:10

Merge branch 'main' into fused-fp8-scale-update

bf50e6b

ptrendx added the 1.4.0 label Jan 30, 2024

timmoon10 and others added 2 commits January 31, 2024 09:52

Merge branch 'main' into fused-fp8-scale-update

039baaf

Merge branch 'main' into fused-fp8-scale-update

7fc00c0

timmoon10 added 3 commits February 8, 2024 01:27

Make update kernel in-place for PyTorch

3081308

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into fused-fp8-scale-update

393b608

Revert cudnn-frontend commit

c597131

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 force-pushed the fused-fp8-scale-update branch from d54bbbe to c597131 Compare February 8, 2024 01:30

Merge branch 'main' into fused-fp8-scale-update

1eea9be

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 merged commit a950061 into NVIDIA:main Feb 8, 2024
9 checks passed

timmoon10 deleted the fused-fp8-scale-update branch April 25, 2024 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement fused kernel for FP8 scale update #593

Implement fused kernel for FP8 scale update #593

timmoon10 commented Jan 6, 2024

timmoon10 commented Jan 6, 2024

yaox12 commented Jan 9, 2024 •

edited

Loading

ptrendx commented Jan 9, 2024

timmoon10 commented Jan 10, 2024 •

edited

Loading

timmoon10 commented Jan 26, 2024 •

edited

Loading

timmoon10 commented Jan 26, 2024

timmoon10 commented Jan 29, 2024

timmoon10 commented Jan 30, 2024

timmoon10 commented Jan 31, 2024

ksivaman commented Feb 3, 2024

ksivaman commented Feb 3, 2024

timmoon10 commented Feb 5, 2024

timmoon10 commented Feb 8, 2024

timmoon10 commented Feb 8, 2024

timmoon10 commented Feb 8, 2024

Implement fused kernel for FP8 scale update #593

Implement fused kernel for FP8 scale update #593

Conversation

timmoon10 commented Jan 6, 2024

timmoon10 commented Jan 6, 2024

yaox12 commented Jan 9, 2024 • edited Loading

ptrendx commented Jan 9, 2024

timmoon10 commented Jan 10, 2024 • edited Loading

timmoon10 commented Jan 26, 2024 • edited Loading

timmoon10 commented Jan 26, 2024

timmoon10 commented Jan 29, 2024

timmoon10 commented Jan 30, 2024

timmoon10 commented Jan 31, 2024

ksivaman commented Feb 3, 2024

ksivaman commented Feb 3, 2024

timmoon10 commented Feb 5, 2024

timmoon10 commented Feb 8, 2024

timmoon10 commented Feb 8, 2024

timmoon10 commented Feb 8, 2024

yaox12 commented Jan 9, 2024 •

edited

Loading

timmoon10 commented Jan 10, 2024 •

edited

Loading

timmoon10 commented Jan 26, 2024 •

edited

Loading