jp6/cu128/: sageattention-2.1.1 metadata and description

Homepage Simple index

Accurate and efficient plug-and-play low-bit attention.

author SageAttention team
description_content_type text/markdown
license Apache 2.0 License
requires_python >=3.9

Because this project isn't in the mirror_whitelist, no releases from root/pypi are included.

File Tox results History
sageattention-2.1.1-cp312-cp312-linux_aarch64.whl
Size
10 MB
Type
Python Wheel
Python
3.12

SageAttention

This repository provides the official implementation of SageAttention and SageAttention2, which achieve surprising speedup on most GPUs without lossing accuracy across all models in a plug-and-play way.

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
Paper: https://arxiv.org/abs/2410.02367
Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, Jianfei Chen

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
Paper: https://arxiv.org/abs/2411.10958
Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen

Local Image

Current Features

Project Updates

FlashAttention2 FlashAttention3 FlashAttention3-FP8 SageAttention
FlashAttention2 FlashAttention3 FlashAttention3-FP8 SageAttention
25'34'' 17'32'' 12'14'' 12'07''

Results for CogVideoX1.5-5B on NVIDIA H20 GPU

Local Image

Local Image

Local Image

Installation

Base environment

Install Package

For the stable Triton-only version, refer to SageAttention-1 and install using pip:

pip install sageattention==1.0.6

To use SageAttention 2.1.1, please compile from source:

git clone https://github.com/thu-ml/SageAttention.git
cd sageattention 
python setup.py install  # or pip install -e .

To benchmark the speed against FlashAttention3, please compile FlashAttention3 from source:

git clone https://github.com/Dao-AILab/flash-attention.git --recursive
git checkout b7d29fb3b79f0b78b1c369a52aaa6628dabfb0d7 # 2.7.2 release
cd hopper
python setup.py install

How to Use

from sageattention import sageattn
attn_output = sageattn(q, k, v, tensor_layout="HND", is_causal=False)

Available APIs:

For optimal speed and accuracy performance on custom devices and models, we strongly recommend referring to the this file for detailed guidance.

Note: Support for different sequence lengths between q and k,v and group-query attention is available.

Plug-and-play Example

We can replace scaled_dot_product_attention easily. We will take CogvideoX as an example:

Add the following codes and run

import torch.nn.functional as F

+ from sageattention import sageattn
+ F.scaled_dot_product_attention = sageattn

Specifically,

cd example
python cogvideox-2b.py --compile --attention_type sage

You can get a lossless video in ./example faster than by using python cogvideox-2b.py --compile. More examples and guidance can be found under the example/ directory.

Note: Not all models works with F.scaled_dot_product_attention = sageattn. Technically, you should replace the original Attention by modifying the Attention Class of the target model. For image and video models, we suggest only replacing the attention in DiT (see example/mochi.py for detail).

Kernel Benchmarking

We provide a benchmarking script to compare the speed of different kernels including SageAttention, FlashAttention2 and FlashAttention3. Please refer to the benchmark/ directory for more details.

Performance

Speed of Kernels

8+8 means the kernel with INT8 quantization for $QK^\top$ and FP8 quantization for $PV$. 8+16 uses FP16 with FP16 accumulator for $PV$.

Local Image

Local Image

Local Image

Local Image

Local Image

Local Image

Note: The TOPS results refer only to the Attention Kernel, excluding the quantization and smoothing.

End-to-end Performance

End-to-End Accuracy:

Local Image

Local Image

Local Image

Local Image

End-to-End Speedup:

Local Image

Citation

If you use this code or find our work valuable, please cite:

@inproceedings{zhang2025sageattention,
  title={SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration}, 
  author={Zhang, Jintao and Wei, Jia and Zhang, Pengle and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025}
}

@inproceedings{zhang2024sageattention2,
  title={Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization},
  author={Zhang, Jintao and Huang, Haofeng and Zhang, Pengle and Wei, Jia and Zhu, Jun and Chen, Jianfei},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2025}
}