[PyTorch/C++] Comm+GEMM overlap compatibility with QuantizedTensor #1427

denera · 2025-01-28T01:56:58Z

Description

This PR updates TE/common and TE/PyTorch API for comm+GEMM overlap to support the new QuantizedTensor abstraction.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Checklist:

[x I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…th cppqtensor Signed-off-by: Alp Dener <adener@nvidia.com> CommOverlap objects can now return overlap buffers to PyTorch as QuantizedTensors Signed-off-by: Alp Dener <adener@nvidia.com> updated comm+GEMM overlap test for pure GEMM, both BF16 and FP8 working with QuantizedTensor Signed-off-by: Alp Dener <adener@nvidia.com> te.Linear and te.LayerNormMLP updated for TP overlap w/ QuantizedTensor. All overlaps work in BF16. All ovrlaps except bulk WGRAD work in FP8. Signed-off-by: Alp Dener <adener@nvidia.com> completed TP overlap QuantizedTensor updates for LayerNormLinear, but issues with quantized normalization Signed-off-by: Alp Dener <adener@nvidia.com> all overlaps working with bf16, all but bulk WGRAD working with FP8 Signed-off-by: Alp Dener <adener@nvidia.com> all overlaps work with Float8Tensor, except bulk wgrad in LayerNormMLP (works in other modules) Signed-off-by: Alp Dener <adener@nvidia.com> all overlaps working with QuantizedTensor in BF16 and FP8 Signed-off-by: Alp Dener <adener@nvidia.com> cleaned up pytest formatting Signed-off-by: Alp Dener <adener@nvidia.com>

for more information, see https://pre-commit.ci

…and updated test sizing Signed-off-by: Alp Dener <adener@nvidia.com>

transformer_engine/pytorch/module/layernorm_linear.py

timmoon10 · 2025-01-28T22:07:34Z

transformer_engine/pytorch/module/layernorm_linear.py

        # Configure quantizer for normalization output
-        if fp8 and input_quantizer is None:
-            raise ValueError("Missing quantizer for input tensor")
+        if fp8:
+            if any([ub_overlap_ag_fprop, ub_overlap_rs_fprop]) and isinstance(
+                FP8GlobalStateManager.get_fp8_recipe(), BlockScaling
+            ):
+                raise NotImplementedError("Comm+GEMM overlap does not support MXFP8 block scaling")
+
+            if input_quantizer is None:
+                raise ValueError("Missing quantizer for input tensor")


Putting UB logic here makes the comment incorrect

This won't generalize when we add more quantization schemes. Instead of assuming that all recipes except MXFP8 support UB, we should only assume FP8 delayed scaling supports UB.

Suggested change

# Configure quantizer for normalization output

if fp8 and input_quantizer is None:

raise ValueError("Missing quantizer for input tensor")

if fp8:

if any([ub_overlap_ag_fprop, ub_overlap_rs_fprop]) and isinstance(

FP8GlobalStateManager.get_fp8_recipe(), BlockScaling

):

raise NotImplementedError("Comm+GEMM overlap does not support MXFP8 block scaling")

if input_quantizer is None:

raise ValueError("Missing quantizer for input tensor")

# Check if overlapped communication is supported

if (

fp8

and (ub_overlap_ag_fprop or ub_overlap_rs_fprop)

and not FP8GlobalStateManager.get_fp8_recipe().delayed()

):

raise NotImplementedError("Comm+GEMM overlap is only supported with FP8 delayed scaling")

# Configure quantizer for normalization output

if fp8:

if input_quantizer is None:

raise ValueError("Missing quantizer for input tensor")

transformer_engine/pytorch/module/layernorm_linear.py

tests/pytorch/distributed/test_comm_gemm_overlap.py

transformer_engine/pytorch/tensor/float8_tensor.py

timmoon10

Overall looks reasonable, although I have some stylistic suggestions. This is fine in our last-minute scramble to restore UB support with FP8. Next we will need to think about extending it to support MXFP8 and other quantization schemes.

transformer_engine/common/include/transformer_engine/comm_gemm_overlap.h

transformer_engine/pytorch/cpp_extensions/gemm.py

timmoon10 · 2025-01-29T01:04:26Z

transformer_engine/pytorch/module/linear.py

        # Prepare input tensor
        # Note: Cast to expected dtype and perform tensor-parallel communication
        inputmat = inp
        inputmat_total = None
-        with_input_all_gather = parallel_mode == "column" and sequence_parallel
+        with_input_all_gather_nccl = (
+            parallel_mode == "column" and sequence_parallel and not ub_overlap_ag_fprop
+        )
        own_quantized_input = False
        if fp8:
+            if any([ub_overlap_ag_fprop, ub_overlap_rs_fprop]) and isinstance(
+                FP8GlobalStateManager.get_fp8_recipe(), BlockScaling
+            ):
+                raise NotImplementedError("Comm+GEMM overlap does not support MXFP8 block scaling")
+


Same as #1427 (comment):

Suggested change

# Prepare input tensor

# Note: Cast to expected dtype and perform tensor-parallel communication

inputmat = inp

inputmat_total = None

with_input_all_gather = parallel_mode == "column" and sequence_parallel

with_input_all_gather_nccl = (

parallel_mode == "column" and sequence_parallel and not ub_overlap_ag_fprop

)

own_quantized_input = False

if fp8:

if any([ub_overlap_ag_fprop, ub_overlap_rs_fprop]) and isinstance(

FP8GlobalStateManager.get_fp8_recipe(), BlockScaling

):

raise NotImplementedError("Comm+GEMM overlap does not support MXFP8 block scaling")

# Check if overlapped communication is supported

if (

fp8

and (ub_overlap_ag_fprop or ub_overlap_rs_fprop)

and not FP8GlobalStateManager.get_fp8_recipe().delayed()

):

raise NotImplementedError("Comm+GEMM overlap is only supported with FP8 delayed scaling")

# Prepare input tensor

# Note: Cast to expected dtype and perform tensor-parallel communication

inputmat = inp

inputmat_total = None

with_input_all_gather_nccl = (

parallel_mode == "column" and sequence_parallel and not ub_overlap_ag_fprop

)

own_quantized_input = False

if fp8:

transformer_engine/pytorch/module/linear.py

tests/pytorch/distributed/run_gemm_with_overlap.py

transformer_engine/pytorch/module/layernorm_mlp.py

…ests Signed-off-by: Alp Dener <adener@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman · 2025-01-31T20:17:16Z

/te-ci pytorch

…ppqtensor-tp-overlap-v2.0

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman

LGTM pending CI; 23439789

denera added the 2.0.0 label Jan 28, 2025

denera requested review from timmoon10 and ptrendx January 28, 2025 01:56

denera self-assigned this Jan 28, 2025

denera force-pushed the blackwell-cppqtensor-tp-overlap-v2.0 branch from 9ba5009 to f1dcf35 Compare January 28, 2025 02:38

pre-commit-ci bot and others added 2 commits January 28, 2025 02:38

[pre-commit.ci] auto fixes from pre-commit.com hooks

4a14548

for more information, see https://pre-commit.ci

removed atomic GEMM tests for comm+GEMM overlap (deprecated in CUDA) …

8bf07c4

…and updated test sizing Signed-off-by: Alp Dener <adener@nvidia.com>

ksivaman self-requested a review January 28, 2025 20:42

timmoon10 reviewed Jan 28, 2025

View reviewed changes

timmoon10 self-requested a review January 28, 2025 22:16

timmoon10 reviewed Jan 28, 2025

View reviewed changes

tests/pytorch/distributed/test_comm_gemm_overlap.py Show resolved Hide resolved

transformer_engine/pytorch/tensor/float8_tensor.py Outdated Show resolved Hide resolved

timmoon10 self-requested a review January 28, 2025 23:44

timmoon10 approved these changes Jan 29, 2025

View reviewed changes

denera and others added 4 commits January 30, 2025 07:06

all TP overlap tests fixed on H100, a few failures remain in sanity t…

9b12005

…ests Signed-off-by: Alp Dener <adener@nvidia.com>

resolve conflicts

8047fbb

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Minor fix, lint, format

3936eda

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Fix mxfp8

af1719e

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman and others added 9 commits February 3, 2025 04:08

Merge remote-tracking branch 'upstream/release_v2.0' into blackwell-c…

348f82f

…ppqtensor-tp-overlap-v2.0

Minor changes/cleanup

aea5252

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Populate column-wise data in FP8 LayerNorm/RMSNorm funcs if provided

af4047a

Signed-off-by: Tim Moon <tmoon@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

158c201

for more information, see https://pre-commit.ci

Fix linter warnings

fa6ebb1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Fix fused attn tests

30adc4a

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Initialize LN output with correct device

2e18c52

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Fix UB distributed tests

59ee45b

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

Fix for non-fp8 cases

8c18905

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman approved these changes Feb 4, 2025

View reviewed changes

ptrendx approved these changes Feb 4, 2025

View reviewed changes

timmoon10 merged commit d715c83 into NVIDIA:release_v2.0 Feb 4, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch/C++] Comm+GEMM overlap compatibility with QuantizedTensor #1427

[PyTorch/C++] Comm+GEMM overlap compatibility with QuantizedTensor #1427

denera commented Jan 28, 2025

timmoon10 Jan 28, 2025

timmoon10 left a comment

timmoon10 Jan 29, 2025

ksivaman commented Jan 31, 2025

ksivaman left a comment

[PyTorch/C++] Comm+GEMM overlap compatibility with QuantizedTensor #1427

[PyTorch/C++] Comm+GEMM overlap compatibility with QuantizedTensor #1427

Conversation

denera commented Jan 28, 2025

Description

Type of change

Checklist:

timmoon10 Jan 28, 2025

Choose a reason for hiding this comment

timmoon10 left a comment

Choose a reason for hiding this comment

timmoon10 Jan 29, 2025

Choose a reason for hiding this comment

ksivaman commented Jan 31, 2025

ksivaman left a comment

Choose a reason for hiding this comment