All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- checkpointing: use dummy tensor to ensure backward pass is called [#701]
- checkpointing: ensure internal fwd counter is not incremented in eval mode [#709]
- FSDP: fixed bug where buffers returned in
state_dict()
could still be half precision whenmixed_precision
is set toTrue
.
- setup.py: hide CUDA extensions behind
BUILD_CUDA_EXTENSIONS
envvar [#634] - checkpointing: rename and move the
checkpoint_activations
wrapper [#654] - FSDP: fix
local_state_dict
potentially called child class'sstate_dict
[#574] - FSDP: fix extra process groups being created by default. Old behavior can cause excessive GPU memory usage [#678] [#681]
- FSDP: fix forward pass not overlapping compute and allgather [#671]
- FSDP: improved frozen weight support [#657]
- FSDP: workaround AMP autocast cache issue with
clear_autocast_cache
flag [#650] - FSDP: Rename API arg
cpu_offload
tomove_params_to_cpu
to better reflect functionality. We will deprecatecpu_offload
in an upcoming release [#676] - MoE: several fixes [#666] [#667] [#668]
- SDP: re-expose the module property [#647]
- wrap: support wrapping based on
wrapper_config
[#685]
- FSDP: added
force_input_to_fp32
flag for SyncBatchNorm [#659] - FSDP: better memory usage for reduce bucket [#633]
- FSDP: added
local_metadata_dict
to save sharding relating information [#683] - FSDP: added
consolidate_shard_weights
to reconstruct the consolidated (non-sharded) model weights from saved sharded weights and metadata on the disk [#683] - Experimental SyncBatchNorm [#662] [#680]
- FSDP: Consolidate cpu_adam optimizer state dict (#607)
- FSDP: handle model with multiple forward pass and checkpoint (#621)
- FSDP & SDP: check before calling
_specify_ddp_gpu_num
(#626) - FSDP: relax checking root condition (#620)
- SDP: removing an assert which does not seem always accurate (#625)
- FSDP: changing FSDP init to by pass pg validation (#619)
- OSS: to 100% coverage (#618)
- [offload] Add API, tutorial and smaller doc string changes. (#576)
- FSDP: fixing training with freezing weights (#614)
- SDP: privatizing all the things (#611)
- FSDP: Make
_get_default_cuda_device
more robust to modules without params (#606) - OffloadModel: Add prev codepath of using OffloadModel without activation checkpointing (#608)
- FSDP: Add no broadcast optim state option (#560)
- ShardedDDP: Properly handle .eval() mode (#587)
- ShardedDDP: Handle model being moved back to CPU prior to state consolidation (#573)
- FSDP: much faster state consolidation (#595)
- FSDP: Add gradient pre-dedivide to prevent overflow with large world sizes (#565)
- Offload: (experimental) Fix activation offloading to CPU (#588
- FSDP: changed
auto_wrap_bn
utility function so that single FSDP group is optional (#556) - FSDP: optimizer state load/save (#537)
- FSDP: fix weight init when using apply() (#543)
- Multiprocess Pipe: retired old implementation
- Experimental: xpipe
- ShardedDDP deferred init (#558)
- Experimental: Add spectrain support (#372)
- FSDP: enabled pytorch SyncBN (no asserting) (#527)
- FSDP: added
auto_wrap_bn
utility function (#531)
- OSS: fix a compatibily problem with lightning wrt optimizer state dict (#510)
- FSDP: fixed a bug when part of autograd graph is traversed multiple times in mixed precision mode (#513)
- FSDP docs (#455)
enable_wrap
andauto_wrap
APIs (#446)- Added experimental.nn.OffloadModel API for training large models on a single GPU.(#432)
- OSS: fix a broken state dict when using non contiguous param groups
- Several SDP fixes around performance and corner cases
- Many FSDP fixes
- AdaScale & SDP/FSDP test added but not officially supported
- FullyShardedDataParallel (FSDP) (#413)
- ShardedDDP fp16 grad reduction option (#402)
- Expose experimental algorithms within the pip package (#410)
- Catch corner case when the model is too small with respect to the world size, and shards are empty (#406)
- Memory leak in
checkpoint_wrapper
(#412)
- ShardedDDP and OSS handle model trainability changes during training (#369)
- ShardedDDP state dict load/save bug (#386)
- ShardedDDP handle train/eval modes (#393)
- AdaScale handling custom scaling factors (#401)
- ShardedDDP manual reduce option for checkpointing (#389)
- Checkpointing model wrapper (#376)
- Faster OSS, flatbuffers (#371)
- Small speedup in OSS clipgradnorm (#363)
- Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)
- Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)
- Better pip integration / resident pytorch (#375)
- Pytorch compatibility for OSS checkpoints (#310)
- Elastic checkpoints for OSS, world size can vary in between save and loads (#310)
- Tensor views for OSS bucketing, reduced CPU use (#300)
- Bucket calls in ShardedDDP, for faster inter node communications (#327)
- FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)
- AMPnet experimental support (#304)
- ShardedDDP properly handles device changes via
.to()
(#353) - Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)
- Missing cu files in the pip package
- Release numbering within python and from pypi
- AdaScale:
. Added gradient accumulation feature (#202)
. Added support of
torch.lr_scheduler
(#229) . Added support foradd_param_groups
(#266) . Added support forscale != world_size
(#266)
- AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
- Pipe: documentation on balancing functions (#243)
- ShardedDDP: handle typical NLP models
- ShardedDDP: better partitioning when finetuning
- make sure pip package includes header files (#221)
- ShardedDataParallel with autoreduce (#157)
- cpu support for Pipe (#188)
- ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182)
- OSS-aware clip grads, bridge sharded states (#167)
- oss: add
rank_local_state_dict
staticmethod (#174) - support for PyTorch 1.7.0 (#171)
- Add implementation of AdaScale (#139)
- pip package install (#196, #200)
- multi-process pipe
- multiple OSS fixes
- MegaTron+OSS DDP fix
- add ddp that works with oss with
reduce()
notall_reduce()
(#19) - support for PyTorch v1.6
- add mixed precision Adam (#40)
- Adam optimizer state scaling (#44)
- properly restore a sharded optim state (#39)
- OSS restore state to proper device (#46)
- optim/oss: support optimizers with additional step kwargs (#53)
- optim/oss: fix state cast (#56)
- fix eval for
oss_ddp
(#55) - optim/oss: work correctly with LRScheduler (#58)
- Initial release.