Merge pull request #1025 from bghira/main

update docs
bghira · Oct 4, 2024 · 565aedd · 565aedd
2 parents 6100762 + 6c8420d
commit 565aedd
Showing 1 changed file with 15 additions and 3 deletions.
diff --git a/OPTIONS.md b/OPTIONS.md
@@ -62,15 +62,18 @@ The script `configure.py` in the project root can be used via `python configure.
 Provided by Hugging Face, the optimum-quanto library has robust support across all supported platforms.
 
 - `int8-quanto` is the most broadly compatible and probably produces the best results
+  - fastest training for RTX4090 and probably other GPUs
   - uses hardware-accelerated matmul on CUDA devices for int8, int4
+    - int4 is still abysmally slow
   - works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
 - `fp8uz-quanto` is an experimental fp8 variant for CUDA and ROCm devices.
   - better-supported on AMD silicon such as Instinct or newer architecture
   - can be slightly faster than `int8-quanto` on a 4090 for training, but not inference (1 second slower)
   - works with `TRAINING_DYNAMO_BACKEND=inductor` (`torch.compile()`)
 - `fp8-quanto` will not (currently) use fp8 matmul, does not work on Apple systems.
   - does not have hardware fp8 matmul yet on CUDA or ROCm devices, so it will possibly be noticeably slower than int8
-  - incompatible with dynamo, will automatically switch to `int8-quanto` for you and keep dynamo enabled for speedup.
+    - uses MARLIN kernel for fp8 GEMM
+  - incompatible with dynamo, will automatically disable dynamo if the combination is attempted.
 
 #### TorchAO
 
@@ -79,8 +82,9 @@ A newer library from Pytorch, AO allows us to replace the linears and 2D convolu
 
 - `int8-torchao` will reduce memory consumption to the same level as any of Quanto's precision levels
   - at the time of writing, runs slightly slower (11s/iter) than Quanto does (9s/iter) on Apple MPS
-  - Same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
-- `fp8-torchao` is not enabled for use due to bugs in the implementation.
+  - When not using `torch.compile`, same speed and memory use as `int8-quanto` on CUDA devices, unknown speed profile on ROCm
+  - When using `torch.compile`, slower than `int8-quanto`
+- `fp8-torchao` is not enabled due to bugs in the implementation.
 
 #### Torch Dynamo
 
@@ -89,6 +93,14 @@ To enable `torch.compile()`, add the following line to `config/config.env`:
 TRAINING_DYNAMO_BACKEND=inductor
 ```
 
+If you wish to use added features like max-autotune, run the following:
+
+```bash
+accelerate config
+```
+
+Carefully answer the questions and use bf16 mixed precision training when prompted. Say **yes** to using Dynamo, **no** to fullgraph, and **yes** to max-autotune.
+
 Note that the first several steps of training will be slower than usual because of compilation occuring in the background.
 
 ---