Add flex attention backend #203

Leymore · 2025-03-03T05:34:54Z

This PR implements NATTEN using FlexAttention.

The provided interfaces include flex_na1d, flex_na2d, and flex_na3d, which can replace na1d, na2d, and na3d, but with kernel_size, dilation and is_casual support only.

Since flex_attention requires kernel compilation, the first run may take longer than usual.

Usage

import torch
from natten.flex import flex_na2d

batch_size, image_height, image_width, num_head, head_dim = 1, 64, 64, 8, 64
query = torch.randn(batch_size, image_height, image_width, num_head, head_dim, device='cuda')
key = torch.randn(batch_size, image_height, image_width, num_head, head_dim, device='cuda')
value = torch.randn(batch_size, image_height, image_width, num_head, head_dim, device='cuda')

output = flex_na2d(query, key, value, kernel_size=11, dilation=1, is_causal=False)

Bug

When head_dim=64, num_tokens<=32, (torch.float16 or torch.bfloat16) and torch.compile(flex_attention) is used, flex_natten produces NaN gradients during backward propagation, even though the forward results remain correct.
- This is a bug in the flex_attention kernel, but the exact trigger conditions are not fully identified. This issue will be reported to the PyTorch team.

TODO

Optimize dilation implementation.

src/natten/flex.py

alihassanijr · 2025-03-03T23:24:59Z

~~Left a few comments, mostly nits.~~ Nvm, had to fix a few things in the unit tests, and applied those changes.

leftover items:

Changelog and documentation
Verifying whether we need python's native call cache; I think torch compile definitely has its own, and the python one just ends up being redundant.

Leymore · 2025-03-04T07:37:05Z

Updated. It looks like the wrapper for compiled flex_attention can be removed, while the one for the flex mask cannot. The create_block_mask function is not simply compiling something, so the cache is not working.

Leymore and others added 2 commits March 3, 2025 00:33

add flex attention backend

af4072b

Integrate flex attention backend

85fdaca