clean CP implementation for flash attention and cuDNN 9.6 #1387

xrennvidia · 2024-12-30T19:10:03Z

Description

Flash attention: only use varlen kernels for THD format
cuDNN >=9.6: fix implementation for packed format of softmax_lse, enabled THD+GQA, enabled SWA with KV all_gather communications.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

xrennvidia · 2024-12-30T21:59:29Z

/te-ci pytorch L1

cyanguwa

LGTM

transformer_engine/pytorch/attention.py

cyanguwa · 2025-01-06T12:03:56Z

/te-ci pytorch L0

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

xrennvidia · 2025-01-07T22:18:20Z

/te-ci pytorch L1

xrennvidia and others added 21 commits November 28, 2024 19:50

make pad_between_seqs check do not consider padding at the end

057de06

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

change CP THD test to make it consider 0-length sequence

4bcc1ce

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

minor change to flash func name

62af46f

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

only use varlen func of flash attention while qkv_format is THD

fdc83fe

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

try to converge code of flash and fused attentions

a7e14bf

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_optim

cc2f3bc

Merge branch 'main' into xren/cp_optim

6041e8d

Merge branch 'main' into xren/cp_optim

685b08c

Merge branch 'main' into xren/cp_optim

d69198a

Merge branch 'main' into xren/cp_optim

7157ad7

Merge branch 'main' into xren/cp_optim

2bccdd0

fix bwd compute with P2P

0e62ee1

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

remove redundant out_per_step view

1706ec4

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

enable cudnn>9.6 and THD+GQA

620f86c

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

enable CP with FusedAttn+SWA+All_Gather

9ca6f8d

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

enable CP with FusedAttn+SWA+All_Gather

7131139

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

code cleaning for cu_seqlens

da50f5b

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d9443d4

for more information, see https://pre-commit.ci

fix some pylint error

09b4f4f

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

minor import change for pylint

1b0eac9

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

more fix for pylint

0dc2553

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

xrennvidia requested a review from cyanguwa December 31, 2024 08:19

cyanguwa approved these changes Jan 6, 2025

View reviewed changes

transformer_engine/pytorch/attention.py Show resolved Hide resolved

xrennvidia added 2 commits January 7, 2025 13:56

fix lse_seqlen in thd out correction

bdedcb8

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_optim

2722165

Merge branch 'main' into xren/cp_optim

86109a6

xrennvidia merged commit 560bccf into NVIDIA:main Jan 8, 2025
14 checks passed

xrennvidia deleted the xren/cp_optim branch January 8, 2025 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean CP implementation for flash attention and cuDNN 9.6 #1387

clean CP implementation for flash attention and cuDNN 9.6 #1387

xrennvidia commented Dec 30, 2024 •

edited by cyanguwa

Loading

xrennvidia commented Dec 30, 2024

cyanguwa left a comment

cyanguwa commented Jan 6, 2025

xrennvidia commented Jan 7, 2025

clean CP implementation for flash attention and cuDNN 9.6 #1387

clean CP implementation for flash attention and cuDNN 9.6 #1387

Conversation

xrennvidia commented Dec 30, 2024 • edited by cyanguwa Loading

Description

Type of change

Checklist:

xrennvidia commented Dec 30, 2024

cyanguwa left a comment

Choose a reason for hiding this comment

cyanguwa commented Jan 6, 2025

xrennvidia commented Jan 7, 2025

xrennvidia commented Dec 30, 2024 •

edited by cyanguwa

Loading