Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

NEXT - TBD

Fixed

checkpointing: use dummy tensor to ensure backward pass is called [#701]
checkpointing: ensure internal fwd counter is not incremented in eval mode [#709]
FSDP: fixed bug where buffers returned in state_dict() could still be half precision when mixed_precision is set to True.

Added

[0.3.7] - 2021-05-17

Fixed

setup.py: hide CUDA extensions behind BUILD_CUDA_EXTENSIONS envvar [#634]
checkpointing: rename and move the checkpoint_activations wrapper [#654]
FSDP: fix local_state_dict potentially called child class's state_dict [#574]
FSDP: fix extra process groups being created by default. Old behavior can cause excessive GPU memory usage [#678] [#681]
FSDP: fix forward pass not overlapping compute and allgather [#671]
FSDP: improved frozen weight support [#657]
FSDP: workaround AMP autocast cache issue with clear_autocast_cache flag [#650]
FSDP: Rename API arg cpu_offload to move_params_to_cpu to better reflect functionality. We will deprecate cpu_offload in an upcoming release [#676]
MoE: several fixes [#666] [#667] [#668]
SDP: re-expose the module property [#647]
wrap: support wrapping based on wrapper_config [#685]

Added

FSDP: added force_input_to_fp32 flag for SyncBatchNorm [#659]
FSDP: better memory usage for reduce bucket [#633]
FSDP: added local_metadata_dict to save sharding relating information [#683]
FSDP: added consolidate_shard_weights to reconstruct the consolidated (non-sharded) model weights from saved sharded weights and metadata on the disk [#683]
Experimental SyncBatchNorm [#662] [#680]

[0.3.6] - 2021-04-26

Added

FSDP: Consolidate cpu_adam optimizer state dict (#607)

Fixed

FSDP: handle model with multiple forward pass and checkpoint (#621)
FSDP & SDP: check before calling _specify_ddp_gpu_num (#626)
FSDP: relax checking root condition (#620)
SDP: removing an assert which does not seem always accurate (#625)
FSDP: changing FSDP init to by pass pg validation (#619)
OSS: to 100% coverage (#618)

[0.3.5] - 2021-04-19

Added

[offload] Add API, tutorial and smaller doc string changes. (#576)

Fixed

FSDP: fixing training with freezing weights (#614)
SDP: privatizing all the things (#611)
FSDP: Make _get_default_cuda_device more robust to modules without params (#606)
OffloadModel: Add prev codepath of using OffloadModel without activation checkpointing (#608)

[0.3.4] - 2021-04-13

Added

FSDP: Add no broadcast optim state option (#560)

Fixed

ShardedDDP: Properly handle .eval() mode (#587)
ShardedDDP: Handle model being moved back to CPU prior to state consolidation (#573)
FSDP: much faster state consolidation (#595)
FSDP: Add gradient pre-dedivide to prevent overflow with large world sizes (#565)
Offload: (experimental) Fix activation offloading to CPU (#588

[0.3.3] - 2021-04-1

Added

FSDP: changed auto_wrap_bn utility function so that single FSDP group is optional (#556)
FSDP: optimizer state load/save (#537)
FSDP: fix weight init when using apply() (#543)
Multiprocess Pipe: retired old implementation
Experimental: xpipe

Fixed

ShardedDDP deferred init (#558)

[0.3.2] - 2021-03-18

Added

Experimental: Add spectrain support (#372)
FSDP: enabled pytorch SyncBN (no asserting) (#527)
FSDP: added auto_wrap_bn utility function (#531)

Fixed

OSS: fix a compatibily problem with lightning wrt optimizer state dict (#510)
FSDP: fixed a bug when part of autograd graph is traversed multiple times in mixed precision mode (#513)

[0.3.1] - 2021-03-09

Added

FSDP docs (#455)
enable_wrap and auto_wrap APIs (#446)
Added experimental.nn.OffloadModel API for training large models on a single GPU.(#432)

Fixed

OSS: fix a broken state dict when using non contiguous param groups
Several SDP fixes around performance and corner cases
Many FSDP fixes
AdaScale & SDP/FSDP test added but not officially supported

[0.3.0] - 2021-02-22

Added

FullyShardedDataParallel (FSDP) (#413)
ShardedDDP fp16 grad reduction option (#402)
Expose experimental algorithms within the pip package (#410)

Fixed

Catch corner case when the model is too small with respect to the world size, and shards are empty (#406)
Memory leak in checkpoint_wrapper (#412)

[0.1.7] - 2021-02-19

Fixed

ShardedDDP and OSS handle model trainability changes during training (#369)
ShardedDDP state dict load/save bug (#386)
ShardedDDP handle train/eval modes (#393)
AdaScale handling custom scaling factors (#401)

Added

ShardedDDP manual reduce option for checkpointing (#389)

[0.1.6] - 2021-02-10

Added

Checkpointing model wrapper (#376)
Faster OSS, flatbuffers (#371)
Small speedup in OSS clipgradnorm (#363)

Fixed

Bug in ShardedDDP with 0.1.5 depending the init (KeyError / OSS)
Much refactoring in Pipe (#357, #358, #360, #362, #370, #373)
Better pip integration / resident pytorch (#375)

[0.1.5] - 2021-02-03

Added

Pytorch compatibility for OSS checkpoints (#310)
Elastic checkpoints for OSS, world size can vary in between save and loads (#310)
Tensor views for OSS bucketing, reduced CPU use (#300)
Bucket calls in ShardedDDP, for faster inter node communications (#327)
FlattenParamWrapper, which flattens module parameters into a single tensor seamlessly (#317)
AMPnet experimental support (#304)

Fixed

ShardedDDP properly handles device changes via .to() (#353)
Add a new interface for AdaScale, AdaScaleWrapper, which makes it compatible with OSS (#347)

[0.1.4] - 2021-01-07

Fixed

Missing cu files in the pip package

[0.1.3] - 2021-01-04

Fixed

Release numbering within python and from pypi

[0.1.2] - 2021-01-04

Added

AdaScale: . Added gradient accumulation feature (#202) . Added support of torch.lr_scheduler (#229) . Added support for add_param_groups (#266) . Added support for scale != world_size (#266)

Fixed

AdaScale: smoothing factor value fixed when using gradient accumulation (#235)
Pipe: documentation on balancing functions (#243)
ShardedDDP: handle typical NLP models
ShardedDDP: better partitioning when finetuning

[0.1.1] - 2020-12-01

Fixed

make sure pip package includes header files (#221)

[0.1.0] - 2020-12-01

Added

ShardedDataParallel with autoreduce (#157)
cpu support for Pipe (#188)
ShardedOptim: Distributed Grad Scaler (for torch AMP) (#182)
OSS-aware clip grads, bridge sharded states (#167)
oss: add rank_local_state_dict staticmethod (#174)
support for PyTorch 1.7.0 (#171)
Add implementation of AdaScale (#139)

Fixed

pip package install (#196, #200)

[0.0.3] - 2020-10-14

Added

multi-process pipe

Fixed

multiple OSS fixes
MegaTron+OSS DDP fix

[0.0.2] - 2020-08-28

Added

add ddp that works with oss with reduce() not all_reduce() (#19)
support for PyTorch v1.6
add mixed precision Adam (#40)
Adam optimizer state scaling (#44)

Fixed

properly restore a sharded optim state (#39)
OSS restore state to proper device (#46)
optim/oss: support optimizers with additional step kwargs (#53)
optim/oss: fix state cast (#56)
fix eval for oss_ddp (#55)
optim/oss: work correctly with LRScheduler (#58)

[0.0.1] - 2020-07-31

Initial release.