Engine | Float32 | Float16 | Int8 | Int4 | CUDA | ROCM | Mac M1/M2 | Training |
---|---|---|---|---|---|---|---|---|
candle | β | β | β | π§ | β | |||
llama.cpp | β | β | β | β | β | π§ | π§ | β |
ctranslate | β | β | β | β | β | β | π§ | β |
onnx | β | β | β | β | β | β | β | |
transformers (pytorch) | β | β | β | β | β | π§ | β | β |
vllm | β | β | β | β | β | π§ | β | β |
exllamav2 | β | β | β | β | β | π§ | β | β |
ctransformers | β | β | β | β | β | π§ | π§ | β |
AutoGPTQ | β | β | β | β | β | β | ||
AutoAWQ | β | β | β | β | β | β | β | β |
DeepSpeed-MII | β | β | β | β | β | β | β | |
PyTorch Lightning | β | β | β | β | β | β | ||
Optimum Nvidia | β | β | β | β | β | β | β | β |
Nvidia TensorRT-LLM | β | β | β | β | β | β | β | β |
- β Supported
- β Not Supported
β οΈ There is a catch related to this- π§ It is supported but not implemented in this current version
The names are by the name of engines. Except when the name is Generic
then it means that the nuance applies to all the engines.
Name | Type | Description |
---|---|---|
candle | Metal backend is supported but it gives terrible performance even in small models like Phi2. For AMD ROCM there is no support as per this issue. | |
candle | π§ | Latest performance for Candle is not implemented. If you want to see the numbers, please check out archive.md which contains the benchmark numbers for Llama 2 7B. |
ctranslate2 | ROCM is not supported; however, works are in progress to have this feature on CTranslate2. No support for Mac M1/M2. | |
onnxruntime | ONNXRuntime in general supports ROCM, but specific to LLMs and ONNXRuntime with HuggingFace Optimum only supports CUDAExecution provider right now. For CPU, it is available but super slow. | |
pytorch lightning | ROCM is supported but not tested for PyTorch Lightning. See this issue. | |
pytorch lightning | Metal is supported in PyTorch Lightning, but for Llama 2 7B Chat or Mistral 7B, it is super slow. | |
AutoGPTQ | AutoGPTQ is a weight-only quantization algorithm. Activation still remains in either float32 or float16. We used a 4-bit weight quantized model for our benchmarks experiment. | |
Generic | π§ | For all the engines which support metal, please check out archive.md which contains the benchmark numbers for Llama 2 7B. |
Deepspeed | DeepSpeed supports training; however, for inference, we have used DeepSpeed MII. |