jp6/cu124/: awq-0.1.0 metadata and description

Simple index

An efficient and accurate low-bit weight quantization(INT3/4) method for LLMs.

classifiers
  • Programming Language :: Python :: 3
  • License :: OSI Approved :: Apache Software License
description_content_type text/markdown
requires_dist
  • accelerate
  • sentencepiece
  • tokenizers >=0.12.1
  • torch >=2.0.0
  • torchvision
  • transformers ==4.36.2
  • lm-eval ==0.3.0
  • texttable
  • toml
  • attributedict
  • protobuf
  • gradio ==3.35.2
  • gradio-client ==0.2.9
  • fastapi
  • uvicorn
  • pydantic ==1.10.14
requires_python >=3.8

Because this project isn't in the mirror_whitelist, no releases from root/pypi are included.

File Tox results History
awq-0.1.0-py3-none-any.whl
Size
108 KB
Type
Python Wheel
Python
3

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

[Paper][Slides][Video]

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.

overview

The current release supports:

Thanks to AWQ, TinyChat can deliver more efficient responses with LLM/VLM chatbots through 4-bit inference.

TinyChat on RTX 4090: W4A16 is 3.4x faster than FP16

TinyChat on Orin: W4A16 is 3.2x faster than FP16

TinyChat also supports inference with vision language models (e.g., VILA, LLaVA). In the following examples, W4A16 quantized models from VILA family are launched with TinyChat.

TinyChat with VILA on 4090

TinyChat with VILA on Orin

Check out TinyChat, which offers a turn-key solution for on-device inference of LLMs and VLMs on resource-constrained edge platforms. With TinyChat, it is now possible to efficiently run large models on small and low-power devices even without Internet connection!

News

Contents

Install

  1. Clone this repository and navigate to AWQ folder
git clone https://github.com/mit-han-lab/llm-awq
cd llm-awq
  1. Install Package
conda create -n awq python=3.10 -y
conda activate awq
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install efficient W4A16 (4-bit weight, 16-bit activation) CUDA kernel and optimized FP16 kernels (e.g. layernorm, positional encodings).
cd awq/kernels
python setup.py install
  1. In order to run AWQ and TinyChat with VILA-1.5 model family, please install VILA:
git clone git@github.com:Efficient-Large-Model/VILA.git
cd VILA
pip install -e .

AWQ Model Zoo

We provide pre-computed AWQ search results for multiple model families, including LLaMA, OPT, Vicuna, and LLaVA. To get the pre-computed AWQ search results, run:

# git lfs install  # install git lfs if not already
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

The detailed support list:

Models Sizes INT4-g128 INT3-g128
VILA-1.5 3B/8B/13B/40B ✅ ✅
Llama3 8B/70B ✅ ✅
VILA 7B/13B ✅
Llama2 7B/13B/70B ✅ ✅
LLaMA 7B/13B/30B/65B ✅ ✅
OPT 125m/1.3B/2.7B/6.7B/13B/30B ✅ ✅
CodeLlama 7B/13B/34B ✅ ✅
StarCoder 15.5B ✅ ✅
Vicuna-v1.1 7B/13B ✅
LLaVA-v0 13B ✅

Note: We only list models that we have prepare the AWQ searching results in the table above. AWQ also supports models such as LLaVA-v1.5 7B, and you may need to run the AWQ search on your own to quantize these models.

Examples

AWQ can be easily applied to various LMs thanks to its good generalization, including instruction-tuned models and multi-modal LMs. It provides an easy-to-use tool to reduce the serving cost of LLMs.

Here we provide two examples of AWQ application: Vicuna-7B (chatbot) and LLaVA-13B (visual reasoning) under ./examples directory. AWQ can easily reduce the GPU memory of model serving and speed up token generation. It provides accurate quantization, providing reasoning outputs. You should be able to observe memory savings when running the models with 4-bit weights.

Note that we perform AWQ using only textual calibration data, depsite we are running on multi-modal input. Please refer to ./examples for details.

overview

Usage

We provide several sample script to run AWQ (please refer to ./scripts). We use Llama3-8B as an example.

  1. Perform AWQ search and save search results (we already did it for you):
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --run_awq --dump_awq awq_cache/llama3-8b-w4-g128.pt
  1. Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend fake
  1. Generate real quantized weights (INT4)
mkdir quant_cache
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/llama3-8b-w4-g128.pt \
    --q_backend real --dump_quant quant_cache/llama3-8b-w4-g128-awq.pt
  1. Load and evaluate the real quantized model (now you can see smaller gpu memory usage)
python -m awq.entry --model_path /PATH/TO/LLAMA3/llama3-8b \
    --tasks wikitext \
    --w_bit 4 --q_group_size 128 \
    --load_quant quant_cache/llama3-8b-w4-g128-awq.pt

Results on Vision-Language Models (VILA-1.5)

AWQ also seamlessly supports large multi-modal models (LMMs). We demonstrate the results on the recent VILA-1.5 model family.

VILA-1.5-3B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 80.4 61.5 53.5 69.0 60.4 85.9 1442.4 63.4 52.7 60.9
AWQ-INT4 80.0 61.1 53.8 67.8 60.4 85.9 1437.3 63.3 51.4 59.8
VILA-1.5-8B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 80.9 61.9 58.7 79.9 66.3 84.4 1577.01 72.3 66.2 64.2
AWQ-INT4 80.3 61.7 59.3 79.0 65.4 82.9 1593.65 71.0 64.9 64.0
VILA-1.5-13B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 82.8 64.3 62.6 80.1 65.0 86.3 1569.55 74.9 66.3 65.1
AWQ-INT4 82.7 64.5 63.3 79.7 64.7 86.7 1531.35 74.7 66.7 65.1
VILA-1.5-40B VQA-v2 GQA VizWiz ScienceQA TextVQA POPE MME MMBench MMBench-CN SEED
FP16 84.3 64.6 62.2 87.2 73.6 87.3 1726.82 82.4 80.2 69.1
AWQ-INT4 84.1 64.4 61.3 86.7 73.2 88.2 1714.79 83.2 79.6 68.9

Inference speed ( Token/sec )

$~~~~~~$ Precision A100 4090 Orin
VILA1.5-3B fp16 104.6 137.6 25.4
VILA1.5-3B-AWQ int4 182.8 215.5 42.5
VILA1.5-3B-S2 fp16 104.3 137.2 24.6
VILA1.5-3B-S2-AWQ int4 180.2 219.3 40.1
Llama-3-VILA1.5-8B fp16 74.9 57.4 10.2
Llama-3-VILA1.5-8B-AWQ int4 168.9 150.2 28.7
VILA1.5-13B fp16 50.9 OOM 6.1
VILA1.5-13B-AWQ int4 115.9 105.7 20.6
VILA1.5-40B fp16 OOM OOM --
VILA1.5-40B-AWQ int4 57.0 OOM --

Reference

If you find AWQ useful or relevant to your research, please kindly cite our paper:

@inproceedings{lin2023awq,
  title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
  author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
  booktitle={MLSys},
  year={2024}
}

Related Projects

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers

Vicuna and FastChat

LLaVA: Large Language and Vision Assistant

VILA: On Pre-training for Visual Language Models