Skip to content

Commit

Permalink
Add open source community (#21)
Browse files Browse the repository at this point in the history
Add open source community
  • Loading branch information
OpenSourceRonin authored Sep 25, 2024
1 parent 535dcb1 commit 41db6e0
Show file tree
Hide file tree
Showing 2 changed files with 38 additions and 13 deletions.
49 changes: 37 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,13 @@ VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and m
* Lightweight Quantization Algorithm: only cost ~17 hours to quantize 405B Llama-3.1
* Agile Quantization Inference: low decode overhead, best throughput, and TTFT

## Details and [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)
## [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables.

Read tech report at [**Tech Report**](https://github.com/microsoft/VPTQ/blob/main/VPTQ_tech_report.pdf)

## Early Results from Tech Report
### Early Results from Tech Report
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.

<img src="assets/vptq.png" width="500">
Expand All @@ -28,10 +29,10 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over
| LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
| | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |


## Install and Evaluation
## Installation and Evaluation

### Dependencies

- python 3.10+
- torch >= 2.2.0
- transformers >= 4.44.0
Expand All @@ -45,43 +46,66 @@ VPTQ achieves better accuracy and higher throughput with lower quantization over
export PATH=/usr/local/cuda-12/bin/:$PATH # set dependent on your environment
```

*Will Take several minutes to compile CUDA kernels*
```python
pip install git+https://github.com/microsoft/VPTQ.git --no-build-isolation
```

### Language Generation
### Models from Open Source Community

⚠️ The repository only provides a method of model quantization algorithm.

⚠️ The open-source community [VPTQ-community](https://huggingface.co/VPTQ-community) provides models based on the technical report and quantization algorithm.

⚠️ The repository cannot guarantee the performance of those models.

| Model Series | Collections |
|:----------------------:|:-----------:|
| Llama 3.1 8B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-8b-instruct-without-finetune-66f2b70b1d002ceedef02d2e) |
| Llama 3.1 70B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-70b-instruct-without-finetune-66f2bf454d3dd78dfee2ff11) |
| Qwen 2.5 7B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-7b-instruct-without-finetune-66f3e9866d3167cc05ce954a) |
| Qwen 2.5 72B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-qwen-25-72b-instruct-without-finetune-66f3bf1b3757dfa1ecb481c0) |
| Qwen 2.5 405B Instruct | [HF 🤗](https://huggingface.co/collections/VPTQ-community/vptq-llama-31-405b-instruct-without-finetune-66f4413f9ba55e1a9e52cfb0) |

### Language Generation Example
To generate text using the pre-trained model, you can use the following code snippet:

The model [*VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft*](https://huggingface.co/VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft) (~1.875 bit) is provided by open source community. The repository cannot guarantee the performance of those models.

```python
python -m vptq --model=LLaMa-2-7b-1.5bi-vptq --prompt="Do Not Go Gentle into That Good Night"
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft --prompt="Explain: Do Not Go Gentle into That Good Night"
```

### Terminal Chatbot Example
Launching a chatbot:
Note that you must use a chat model for this to work

```python
python -m vptq --model=LLaMa-2-7b-chat-1.5b-vptq --chat
python -m vptq --model=VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft --chat
```

### Python API Example
Using the Python API:

```python
import vptq
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("LLaMa-2-7b-1.5bi-vptq")
m = vptq.AutoModelForCausalLM.from_pretrained("LLaMa-2-7b-1.5bi-vptq", device_map='auto')
tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft")
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft", device_map='auto')

inputs = tokenizer("Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

### Gradio app example
### Gradio Web App Example
A environment variable is available to control share link or not.
`export SHARE_LINK=1`
```
python -m vptq.app
```


## Road Map
- [ ] Merge the quantization algorithm into the public repository.
- [ ] Submit the VPTQ method to various inference frameworks (e.g., vLLM, llama.cpp).
Expand All @@ -99,6 +123,7 @@ python -m vptq.app
* We are deeply grateful for the inspiration provided by the papers QUIP, QUIP#, GPTVQ, AQLM, WoodFisher, GPTQ, and OBC.

## Publication

EMNLP 2024 Main
```bibtex
@inproceedings{
Expand All @@ -118,7 +143,7 @@ EMNLP 2024 Main

## Limitation of VPTQ
* ⚠️ VPTQ should only be used for research and experimental purposes. Further testing and validation are needed before you use it.
* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository project cannot guarantee the performance of those models.
* ⚠️ The repository only provides a method of model quantization algorithm. The open-source community may provide models based on the technical report and quantization algorithm by themselves, but the repository cannot guarantee the performance of those models.
* ⚠️ VPTQ is not capable of testing all potential applications and domains, and VPTQ cannot guarantee the accuracy and effectiveness of VPTQ across other tasks or scenarios.
* ⚠️ Our tests are all based on English texts; other languages are not included in the current testing.

Expand Down
2 changes: 1 addition & 1 deletion vptq/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from vptq.app_utils import get_chat_loop_generator

chat_completion = get_chat_loop_generator("VPTQ-community/Meta-Llama-3.1-8B-Instruct-v8k65536-4096")
chat_completion = get_chat_loop_generator("VPTQ-community/Meta-Llama-3.1-70B-Instruct-v8-k32768-0-woft")


def respond(
Expand Down

0 comments on commit 41db6e0

Please sign in to comment.