[common] Generalized MXFP8 gated kernels w.r.t. input tensor dimensions #1449

Oleg-Goncharov · 2025-02-01T02:34:33Z

Description

This PR lifts the restrictions on the shape of the input tensor for gated MXFP8 fused kernels. Similar to PR#1437, it allows an arbitrary number of rows (or the product of all tensor dimensions except the last of high-dimensional tensors).
The number of columns (or the dimensionality of the last dimension of high-dimensional tensors) must be a multiple of 32.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Removed the isFullTile requirement in the quantize_gated function.
Added second tensor of scaling factors for gate- part of the gated output tensor.
Gated MXFP8 Test Suite: Added the alignment/padding requirement for dimensions of the tensors with scaling factors to be multiples of [128,4] for row-wise and [4,128] for column-wise scaling.
Restricted the last dimension of the gated input tensor to multiple of 32.
Modified exp2f_rcp function in the testing suites, so it returns 1.0 if the biased_exp==0 to avoid NaN in the amax=0 scenario.
Commented out NVTE_CHECK in the CheckScaleTensorShape of the transformer_engine.cpp to suppress errors generated for the tensor shapes that are now supported.
Added additional tensor descriptor into gated kernels to properly handle boundaries of the activation and the gate- part of the input/output tensors.
Refactored the test suite.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10

LGTM, pending testing. Started pipeline 23308831.

tests/cpp/operator/test_cast_mxfp8_gated_swiglu.cu

transformer_engine/common/transformer_engine.cpp

transformer_engine/common/util/cast_gated_kernels.cuh

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Fixed scaling tensor alignment/padding

f061a29

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov added bug Something isn't working testing Improvements to tests or testing infrastructure 2.0.0 labels Feb 1, 2025

Oleg-Goncharov requested review from timmoon10 and ptrendx February 1, 2025 02:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

9ec386a

for more information, see https://pre-commit.ci

timmoon10 approved these changes Feb 1, 2025

View reviewed changes

timmoon10 reviewed Feb 1, 2025

View reviewed changes

transformer_engine/common/util/cast_gated_kernels.cuh Outdated Show resolved Hide resolved

ptrendx and others added 3 commits February 1, 2025 20:40

Merge branch 'release_v2.0' into pr_mxfp8_gated_scales_fix

56594ea

Changes from review

c385529

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

cf8b2fe

for more information, see https://pre-commit.ci

ptrendx approved these changes Feb 2, 2025

View reviewed changes

Oleg-Goncharov and others added 12 commits February 4, 2025 01:01

Fixed alignment and padding in scaled tensors. Refactoring.

33a03ca

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Skipped scenarios for non-mod(32) tensors

01955fc

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a90dc70

for more information, see https://pre-commit.ci

Fixes

1a0cb00

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

More fixes

6e1a2be

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

Some fixes to the CPU reference

8339f0b

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

229eb7a

for more information, see https://pre-commit.ci

Fixed typo in the kernel. Restricted the last dim to multiples of 32

14b2c47

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed TMA writes overlap

7923042

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

dfccb1b

for more information, see https://pre-commit.ci

Merge branch 'release_v2.0' into pr_mxfp8_gated_scales_fix

f0a15b3

Remove the largest test cases for numerical stability

72d5257

Signed-off-by: Przemek Tredak <ptredak@nvidia.com>

ptrendx merged commit ce8b127 into NVIDIA:release_v2.0 Feb 5, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[common] Generalized MXFP8 gated kernels w.r.t. input tensor dimensions #1449

[common] Generalized MXFP8 gated kernels w.r.t. input tensor dimensions #1449

Oleg-Goncharov commented Feb 1, 2025 •

edited

Loading

timmoon10 left a comment •

edited

Loading

[common] Generalized MXFP8 gated kernels w.r.t. input tensor dimensions #1449

[common] Generalized MXFP8 gated kernels w.r.t. input tensor dimensions #1449

Conversation

Oleg-Goncharov commented Feb 1, 2025 • edited Loading

Description

Type of change

Changes

Checklist:

timmoon10 left a comment • edited Loading

Choose a reason for hiding this comment

Oleg-Goncharov commented Feb 1, 2025 •

edited

Loading

timmoon10 left a comment •

edited

Loading