Skip to content

Commit

Permalink
Merge pull request #1025 from bghira/main
Browse files Browse the repository at this point in the history
update docs
  • Loading branch information
bghira authored Oct 4, 2024
2 parents 6100762 + 6c8420d commit 565aedd
Showing 1 changed file with 15 additions and 3 deletions.
18 changes: 15 additions & 3 deletions OPTIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,15 +62,18 @@ The script `configure.py` in the project root can be used via `python configure.
Provided by Hugging Face, the optimum-quanto library has robust support across all supported platforms.

- `int8-quanto` is the most broadly compatible and probably produces the best results
- fastest training for RTX4090 and probably other GPUs
- uses hardware-accelerated matmul on CUDA devices for int8, int4
- int4 is still abysmally slow
- works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
- `fp8uz-quanto` is an experimental fp8 variant for CUDA and ROCm devices.
- better-supported on AMD silicon such as Instinct or newer architecture
- can be slightly faster than `int8-quanto` on a 4090 for training, but not inference (1 second slower)
- works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
- `fp8-quanto` will not (currently) use fp8 matmul, does not work on Apple systems.
- does not have hardware fp8 matmul yet on CUDA or ROCm devices, so it will possibly be noticeably slower than int8
- incompatible with dynamo, will automatically switch to `int8-quanto` for you and keep dynamo enabled for speedup.
- uses MARLIN kernel for fp8 GEMM
- incompatible with dynamo, will automatically disable dynamo if the combination is attempted.

#### TorchAO

Expand All @@ -79,8 +82,9 @@ A newer library from Pytorch, AO allows us to replace the linears and 2D convolu

- `int8-torchao` will reduce memory consumption to the same level as any of Quanto's precision levels
- at the time of writing, runs slightly slower (11s/iter) than Quanto does (9s/iter) on Apple MPS
- Same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
- `fp8-torchao` is not enabled for use due to bugs in the implementation.
- When not using `torch.compile`, same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
- When using `torch.compile`, slower than `int8-quanto`
- `fp8-torchao` is not enabled due to bugs in the implementation.

#### Torch Dynamo

Expand All @@ -89,6 +93,14 @@ To enable `torch.compile()`, add the following line to `config/config.env`:
TRAINING_DYNAMO_BACKEND=inductor
```

If you wish to use added features like max-autotune, run the following:

```bash
accelerate config
```

Carefully answer the questions and use bf16 mixed precision training when prompted. Say **yes** to using Dynamo, **no** to fullgraph, and **yes** to max-autotune.

Note that the first several steps of training will be slower than usual because of compilation occuring in the background.

---
Expand Down

0 comments on commit 565aedd

Please sign in to comment.