llm-infra

Here is 1 public repository matching this topic...

thu-ml / SageAttention

Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

cuda triton attention vit quantization video-generation mlsys inference-acceleration efficient-attention llm llm-infra video-generate

Updated Apr 21, 2025
Cuda

Improve this page

Add a description, image, and links to the llm-infra topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-infra topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-infra

Here is 1 public repository matching this topic...

thu-ml / SageAttention

Improve this page

Add this topic to your repo