CPU Offloading w/ FSDP - gradient accumulation is potentially broken #414

JamesKunstle · 2025-01-27T15:04:16Z

From the FSDP docs:
"FSDP currently does not support gradient accumulation outside no_sync() when using CPU offloading. This is because FSDP uses the newly-reduced gradient instead of accumulating with any existing gradient, which can lead to incorrect results."

https://pytorch.org/docs/stable/fsdp.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Offloading w/ FSDP - gradient accumulation is potentially broken #414

CPU Offloading w/ FSDP - gradient accumulation is potentially broken #414

JamesKunstle commented Jan 27, 2025

CPU Offloading w/ FSDP - gradient accumulation is potentially broken #414

CPU Offloading w/ FSDP - gradient accumulation is potentially broken #414

Comments

JamesKunstle commented Jan 27, 2025