Q8 Weight Quantization #24

jafioti · 2024-02-26T20:33:19Z

This PR adds support for 8 bit quantization in the ggml q8_0 format. The main changes are MetalQuantizedCompiler which takes a set of weights and converts downstream matmuls and gathers to QuantizedMatmuls and QuantizedGathers. Each op takes in weights in q8, and does dequantization internally.

In the mistral example, there is a MetalQ8Loader which demonstrates how to load from q q8 gguf file, and returns the set of weight nodes.

Mistral Decode Speed on M1 Pro: Average token generated in 56.18ms - (17.80 tok/s)

jafioti added 9 commits February 17, 2024 00:15

broke af

9326fe3

temp

dd369f1

comparisons

edc96f3

Working 8bit mistral

b8a0f08

Cleanup

7ee1cad

furthur cleanup

6dc2a99

Started rope metal

e279912

Working fused rope kernel

881fa13

tweaks

e10a196

jafioti merged commit 5ede551 into main Feb 26, 2024
1 check passed

jafioti deleted the q8 branch February 26, 2024 20:33

jafioti mentioned this pull request Feb 27, 2024

Mistral 7B at 10 tok/s #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q8 Weight Quantization #24

Q8 Weight Quantization #24

jafioti commented Feb 26, 2024

Q8 Weight Quantization #24

Q8 Weight Quantization #24

Conversation

jafioti commented Feb 26, 2024