[PyTorch] Stop storing fused weight tensor in linear modules #719

timmoon10 · 2024-03-13T22:32:27Z

We've encountered excessive memory usage with FSDP-like workflows because Linear and LayerNormLinear sometimes store a tensor for fused weights. The weight params may be manipulated externally, e.g. to deallocate them or make them views into an all-gather buffer, but the modules hold on to the original buffers and prevent any memory savings. This PR updates the _noop_cat utility function so that it no longer requires the full output tensor, but can figure things out by inspecting the pointers and strides. There is one functional difference: Linear and LayerNormLinear no longer make any attempt to readjust the params if they are not contiguous, but will just concatenate as needed. Besides, if FSDP is misconfigured it will just wipe out the existing buffers and create new misaligned weights in every forward pass, so there's no hope that TE can repair the situation.

This is related to #570, which removed the fused weight tensors in cases without split weight params.

Stop storing fused buffers in linear modules. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-03-13T22:32:42Z

/te-ci pytorch

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-03-14T22:15:52Z

This PR has exposed a bug in our numeric tests. Previously, Linear and LayerNormLinear would hold onto the fused weight tensor and would "repair" the params by making sure they were views of that tensor. However, most of the tests in test_numerics.py initialize TE modules like:

block = te.Linear(...).to(dtype=dtype).cuda()

~~Since the module was initialized in FP32, it would "repair" FP16/BF16 params by turning them back to FP32. No wonder we were able to achieve such tight numerical tolerances.~~

I see the existing implementation reallocates the fused weight tensor correctly:

TransformerEngine/transformer_engine/pytorch/module/_common.py

Line 186 in a3ba77b

full_tensor.data = torch.cat(tensors)

It is strange to me that the tests were previously passing with bit-wise exact accuracy though. I wonder if initializing the weights directly in FP16/BF16 instead of initializing in FP32 and casting resulted in tensors that were not bit-wise identical.

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-03-15T03:42:28Z

/te-ci pytorch

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-03-15T22:04:37Z

/te-ci pytorch

ksivaman · 2024-03-25T23:56:21Z

/te-ci pytorch

timmoon10 · 2024-03-29T00:33:53Z

The failed test is unrelated (test_layernorm_accuracy) and it passed after rerunning.

deepakn94 · 2024-04-10T18:56:20Z

Anything holding up this being merged?

ksivaman · 2024-04-10T19:00:09Z

/te-ci pytorch

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2024-04-16T21:55:11Z

/te-ci pytorch

timmoon10 · 2024-04-18T00:41:14Z

/te-ci pytorch

) * Support noop concat without providing full tensor Stop storing fused buffers in linear modules. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug noop cat func Signed-off-by: Tim Moon <tmoon@nvidia.com> * Construct TE modules in tests with correct dtypes Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add tolerances to numerical tests Signed-off-by: Tim Moon <tmoon@nvidia.com> * Use plain PyTorch concat when exporting to ONNX Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

Support noop concat without providing full tensor

e1e0fa2

Stop storing fused buffers in linear modules. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the enhancement New feature or request label Mar 13, 2024

timmoon10 requested a review from ksivaman March 13, 2024 22:32

Merge branch 'main' into noop-concat-refactor

30564b8

timmoon10 marked this pull request as draft March 14, 2024 20:46

timmoon10 added 2 commits March 14, 2024 21:53

Debug noop cat func

81de152

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Construct TE modules in tests with correct dtypes

55bc891

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 and others added 2 commits March 15, 2024 03:41

Add tolerances to numerical tests

541af32

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into noop-concat-refactor

f22402e

timmoon10 marked this pull request as ready for review March 15, 2024 03:42

timmoon10 added testing Improvements to tests or testing infrastructure bug Something isn't working labels Mar 15, 2024

Use plain PyTorch concat when exporting to ONNX

9899955

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into noop-concat-refactor

5927571

Merge branch 'main' into noop-concat-refactor

efdd07a

Merge branch 'main' into noop-concat-refactor

01489d9

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

Merge branch 'main' into noop-concat-refactor

32d760b

timmoon10 merged commit 2a0fe78 into NVIDIA:main Apr 19, 2024
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Stop storing fused weight tensor in linear modules #719

[PyTorch] Stop storing fused weight tensor in linear modules #719

timmoon10 commented Mar 13, 2024

timmoon10 commented Mar 13, 2024

timmoon10 commented Mar 14, 2024 •

edited

Loading

timmoon10 commented Mar 15, 2024

timmoon10 commented Mar 15, 2024

ksivaman commented Mar 25, 2024

timmoon10 commented Mar 29, 2024

deepakn94 commented Apr 10, 2024

ksivaman commented Apr 10, 2024

timmoon10 commented Apr 16, 2024

timmoon10 commented Apr 18, 2024

[PyTorch] Stop storing fused weight tensor in linear modules #719

[PyTorch] Stop storing fused weight tensor in linear modules #719

Conversation

timmoon10 commented Mar 13, 2024

timmoon10 commented Mar 13, 2024

timmoon10 commented Mar 14, 2024 • edited Loading

timmoon10 commented Mar 15, 2024

timmoon10 commented Mar 15, 2024

ksivaman commented Mar 25, 2024

timmoon10 commented Mar 29, 2024

deepakn94 commented Apr 10, 2024

ksivaman commented Apr 10, 2024

timmoon10 commented Apr 16, 2024

timmoon10 commented Apr 18, 2024

timmoon10 commented Mar 14, 2024 •

edited

Loading