Add NVTX ranges to categorize execution #1447

minitu · 2025-01-31T23:50:56Z

Description

Adds NVTX ranges to categorize different parts of the execution.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

minitu · 2025-02-01T00:08:45Z

Please review @timmoon10

timmoon10

LGTM. I have some very minor renaming suggestions. We can merge once we fix merge conflicts and the TE main branch has been reconciled with release_v2.0.

For future reference, my comments on an earlier version of this PR: The overall design is delicate, but I can't think of a better approach. The torch.cuda.nvtx.range context is nicer than range_push/range_pop, but its CPU overhead is too high (1.7 us compared to 0.5 us).

transformer_engine/pytorch/module/layernorm_linear.py

transformer_engine/pytorch/module/linear.py

minitu · 2025-02-06T22:02:37Z

@timmoon10 Sounds good, thanks!

timmoon10 · 2025-02-06T22:24:42Z

I've added the option in nvtx_range_pop to explicitly specify the NVTX range, and it will throw an error if it gets something unexpected. This should help catch cases mismatches and mangled NVTX ranges. In the future, TE should replace its existing NVTX markers with these utility functions.

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2025-02-07T23:52:21Z

/te-ci pytorch

minitu force-pushed the llama31_automated_breakdown branch from 58fa25a to 9e5abb5 Compare February 1, 2025 00:06

timmoon10 reviewed Feb 6, 2025

View reviewed changes

timmoon10 force-pushed the llama31_automated_breakdown branch from 9e21a84 to 2914da2 Compare February 6, 2025 22:19

timmoon10 force-pushed the llama31_automated_breakdown branch from 2914da2 to 54db46e Compare February 7, 2025 23:45

timmoon10 changed the base branch from release_v2.0 to main February 7, 2025 23:45

Add NVTX ranges to categorize execution

908461a

Signed-off-by: Jaemin Choi <jaeminc@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Jaemin Choi <jaeminc@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com>

timmoon10 force-pushed the llama31_automated_breakdown branch from 54db46e to 908461a Compare February 7, 2025 23:51

timmoon10 approved these changes Feb 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVTX ranges to categorize execution #1447

Add NVTX ranges to categorize execution #1447

minitu commented Jan 31, 2025

minitu commented Feb 1, 2025

timmoon10 left a comment •

edited

Loading

minitu commented Feb 6, 2025

timmoon10 commented Feb 6, 2025 •

edited

Loading

timmoon10 commented Feb 7, 2025

Add NVTX ranges to categorize execution #1447

Are you sure you want to change the base?

Add NVTX ranges to categorize execution #1447

Conversation

minitu commented Jan 31, 2025

Description

Type of change

Changes

Checklist:

minitu commented Feb 1, 2025

timmoon10 left a comment • edited Loading

Choose a reason for hiding this comment

minitu commented Feb 6, 2025

timmoon10 commented Feb 6, 2025 • edited Loading

timmoon10 commented Feb 7, 2025

timmoon10 left a comment •

edited

Loading

timmoon10 commented Feb 6, 2025 •

edited

Loading