[PyTorch] upgrade context parallelism implementations #572

xrennvidia · 2023-12-19T02:55:54Z

port cuDNN Flash Attn API to CP implementation
support both unidirectional and bidirectional attentions
make CP implementation work with window_sizes of [-1, -1] and [-1, 0]

Signed-off-by: xren <xren@nvidia.com>

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

cyanguwa · 2024-01-05T22:51:52Z

/te-ci pytorch

qa/L0_pytorch_unittest/test.sh

transformer_engine/pytorch/attention.py

cyanguwa · 2024-01-05T23:10:04Z

transformer_engine/pytorch/attention.py

+                                    q_inputs[i%2], kv_inputs[i%2][0], kv_inputs[i%2][1], TE_DType[q.dtype],
+                                    tex.NVTE_Fused_Attn_Backend.NVTE_F16_arbitrary_seqlen,
+                                    attn_scale=softmax_scale, dropout=dropout_p,
+                                    qkv_layout="bshd_bshd_bshd", attn_mask_type="causal",
                                )
                            else:


FlashAttention and FusedAttention might not be mutually exclusive. They could be both True or both False. Maybe it's better to have another flag use_flash_attention passed in here?

transformer_engine/pytorch/attention.py

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

cyanguwa · 2024-01-06T01:20:30Z

/te-ci pytorch

Infi-zc · 2024-01-08T07:46:52Z

Hi @xrennvidia ! Thanks for the contributions to context parallel. I have been utilizing context parallel, and it has effectively addressed the issue of training with excessively long contexts. However, considering that input sequences with variable length subseqs (called varlen in flash_attn) could potentially enhance the efficiency of attention calculations, we are also interested in training with varlen and context parallel. May I inquire if there are any plans to support context parallel under variable length conditions?

Infi-zc · 2024-01-08T12:17:34Z

Hi @xrennvidia ! Thanks for the contributions to context parallel. I have been utilizing context parallel, and it has effectively addressed the issue of training with excessively long contexts. However, considering that input sequences with variable length subseqs (called varlen in flash_attn) could potentially enhance the efficiency of attention calculations, we are also interested in training with varlen and context parallel. May I inquire if there are any plans to support context parallel under variable length conditions?

hi @cyanguwa, I was wondering if there are any plans to support variable length in context parallelization within the transformer engine? I'm interested to know if it's on the roadmap. Thanks!

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

cyanguwa · 2024-01-08T21:12:22Z

Hi @xrennvidia ! Thanks for the contributions to context parallel. I have been utilizing context parallel, and it has effectively addressed the issue of training with excessively long contexts. However, considering that input sequences with variable length subseqs (called varlen in flash_attn) could potentially enhance the efficiency of attention calculations, we are also interested in training with varlen and context parallel. May I inquire if there are any plans to support context parallel under variable length conditions?

hi @cyanguwa, I was wondering if there are any plans to support variable length in context parallelization within the transformer engine? I'm interested to know if it's on the roadmap. Thanks!

I think there's some plan to add variable length support on the cuDNN side but I don't think there's a definitive timeline at the moment.

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

xrennvidia · 2024-01-08T23:40:47Z

Hi @xrennvidia ! Thanks for the contributions to context parallel. I have been utilizing context parallel, and it has effectively addressed the issue of training with excessively long contexts. However, considering that input sequences with variable length subseqs (called varlen in flash_attn) could potentially enhance the efficiency of attention calculations, we are also interested in training with varlen and context parallel. May I inquire if there are any plans to support context parallel under variable length conditions?

Happy to know that CP is helpful for you :)

We still do no have concrete plan of supporting variable sequence length. I think you mean thd format, right? This format does not have sequence dimension, which make it difficult to do sequence partitioning. And also, if sequence length is variable across input sentences, it's also very hard to do sequence partitioning with load balancing.

Will consider this anyway, but no expected ETA.

cyanguwa · 2024-01-09T18:06:27Z

/te-ci pytorch

cyanguwa

Tentative CI passed: 79121842. LGTM.

Infi-zc · 2024-01-10T03:02:07Z

Hi @xrennvidia ! Thanks for the contributions to context parallel. I have been utilizing context parallel, and it has effectively addressed the issue of training with excessively long contexts. However, considering that input sequences with variable length subseqs (called varlen in flash_attn) could potentially enhance the efficiency of attention calculations, we are also interested in training with varlen and context parallel. May I inquire if there are any plans to support context parallel under variable length conditions?

Happy to know that CP is helpful for you :)

We still do no have concrete plan of supporting variable sequence length. I think you mean thd format, right? This format does not have sequence dimension, which make it difficult to do sequence partitioning. And also, if sequence length is variable across input sentences, it's also very hard to do sequence partitioning with load balancing.

Will consider this anyway, but no expected ETA.

Got it, thanks for the reply!

Infi-zc · 2024-01-10T03:02:15Z

Hi @xrennvidia ! Thanks for the contributions to context parallel. I have been utilizing context parallel, and it has effectively addressed the issue of training with excessively long contexts. However, considering that input sequences with variable length subseqs (called varlen in flash_attn) could potentially enhance the efficiency of attention calculations, we are also interested in training with varlen and context parallel. May I inquire if there are any plans to support context parallel under variable length conditions?

hi @cyanguwa, I was wondering if there are any plans to support variable length in context parallelization within the transformer engine? I'm interested to know if it's on the roadmap. Thanks!

I think there's some plan to add variable length support on the cuDNN side but I don't think there's a definitive timeline at the moment.

Got it, thanks for the reply!

donnyyou · 2024-01-10T10:44:58Z

transformer_engine/pytorch/attention.py

+                                # [b, 2, sq//2, np, hn] -> [b, sq, np, hn]
+                                q_inputs[i%2] = q.view(q.shape[0], -1, *q.shape[-2:])
+                                # [2, b, 2, sk//2, np, hn] -> [2, b, sk//2, np, hn]
+                                kv_inputs[i%2] = kv_inputs[i%2][:, :, 0, ...].contiguous()


Could you please explain why there only using half of the sequence from neighbor node when i <= rank? Is it correct that we missed another part kv_inputs?

yiakwy-xpu-ml-framework-team · 2024-03-27T08:05:45Z

transformer_engine/pytorch/attention.py

+
+        if causal:
+            # [b, s, np, hn] -> [b, 2, s//2, np, hn]
+            q, k, v = [x.view(x.shape[0], 2, x.shape[1]//2, *x.shape[2:]) for x in [q, k, v]]


Hi @xrennvidia are you trying to load balance causal attention att(Q, K_i, V_i). If that is true, I have done this for both fwd and bwd with triton kernel, and there is no need for users to take care of it outside of kernels:

https://github.com/yiakwy-xpu-ml-framework-team/triton/blob/add_support_flash_attention_v3/python/triton/ops/flash_attention.py

xrennvidia added 30 commits October 11, 2023 20:08

try to use cuDNN fused attention for context parallelism

b0c887b

Signed-off-by: xren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

067a50c

assert CP is only supported with NVTE_F16_arbitrary_seqlen

9db666f

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

250ee38

Merge branch 'main' into xren/cp_with_fused_attn

901d22f

port fused attn api to context parallelism

aef3e32

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

8ee5adf

add one more assert

82f1580

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

assert CP does not support padded tokens

63d6aac

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

add qkv_format into CP implementation

8ea88c4

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

dab905e

remove qkv_format from CP function

07a0fab

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

fix qkv_for,at

c82b3b3

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

72ae65d

fix bwd error with FA v2

cd8ca2d

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

fix bwd issue with fa v2 and cudnn fa

4cbeb25

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

make cp implementation support non-causal masking

a4337ad

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

bug fix

d00f620

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

merge with main

b74c6ce

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

0b1d529

Merge branch 'main' into xren/cp_with_fused_attn

c6c1660

Merge branch 'main' into xren/cp_with_fused_attn

570e832

Merge branch 'main' into xren/cp_with_fused_attn

ccecbbd

Merge branch 'main' into xren/cp_with_fused_attn

5678457

merge with main

d2f4771

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

remove redundant asserts for CP

9f6633c

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

minor assert information change

8b33e3d

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

assert core attn bias has not been supported with CP yet

9696877

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

ffa311c

Merge branch 'main' into xren/cp_with_fused_attn

c7f6433

xrennvidia and others added 2 commits January 4, 2024 13:04

Merge branch 'main' into xren/cp_with_fused_attn

080744d

Merge branch 'main' into xren/cp_with_fused_attn

31b83d1

cyanguwa reviewed Jan 5, 2024

View reviewed changes

qa/L0_pytorch_unittest/test.sh Outdated Show resolved Hide resolved

cyanguwa reviewed Jan 5, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Show resolved Hide resolved

cyanguwa reviewed Jan 5, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

cyanguwa reviewed Jan 5, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Show resolved Hide resolved

cyanguwa reviewed Jan 5, 2024

View reviewed changes

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

xrennvidia added 5 commits January 5, 2024 15:32

class and function naming fix

5719a2d

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

docstring fix

4bfb36c

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

do not use fused attn if backend does not work with CP

ed43920

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

create a separate folder for CP test as it needs multi-GPUs

553c80c

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

add attn_mask_type check in attn_forwrad_func_with_cp

f5855bd

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

xrennvidia added 2 commits January 8, 2024 11:37

merge with main

efc5f6f

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

Merge branch 'main' into xren/cp_with_fused_attn

f9ba250

code format fix

d7fa391

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

cyanguwa approved these changes Jan 9, 2024

View reviewed changes

cyanguwa merged commit 94f54d7 into NVIDIA:main Jan 10, 2024
21 checks passed

donnyyou reviewed Jan 10, 2024

View reviewed changes

xrennvidia deleted the xren/cp_with_fused_attn branch January 10, 2024 20:03

yiakwy-xpu-ml-framework-team reviewed Mar 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] upgrade context parallelism implementations #572

[PyTorch] upgrade context parallelism implementations #572

xrennvidia commented Dec 19, 2023

cyanguwa commented Jan 5, 2024

cyanguwa Jan 5, 2024

cyanguwa commented Jan 6, 2024

Infi-zc commented Jan 8, 2024

Infi-zc commented Jan 8, 2024

cyanguwa commented Jan 8, 2024

xrennvidia commented Jan 8, 2024 •

edited

Loading

cyanguwa commented Jan 9, 2024

cyanguwa left a comment •

edited

Loading

Infi-zc commented Jan 10, 2024

Infi-zc commented Jan 10, 2024

donnyyou Jan 10, 2024

yiakwy-xpu-ml-framework-team Mar 27, 2024

[PyTorch] upgrade context parallelism implementations #572

[PyTorch] upgrade context parallelism implementations #572

Conversation

xrennvidia commented Dec 19, 2023

cyanguwa commented Jan 5, 2024

cyanguwa Jan 5, 2024

Choose a reason for hiding this comment

cyanguwa commented Jan 6, 2024

Infi-zc commented Jan 8, 2024

Infi-zc commented Jan 8, 2024

cyanguwa commented Jan 8, 2024

xrennvidia commented Jan 8, 2024 • edited Loading

cyanguwa commented Jan 9, 2024

cyanguwa left a comment • edited Loading

Choose a reason for hiding this comment

Infi-zc commented Jan 10, 2024

Infi-zc commented Jan 10, 2024

donnyyou Jan 10, 2024

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team Mar 27, 2024

Choose a reason for hiding this comment

xrennvidia commented Jan 8, 2024 •

edited

Loading

cyanguwa left a comment •

edited

Loading