v0.2.0
🚀 Streaming v0.2.0
Streaming v0.2.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.2.0
New Features
-
Elastic world size deterministic shuffle
Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.
-
Instant Mid-Epoch Resumption
Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.
-
NEW StreamingDataLoader
AStreamingDataLoader
is a drop-in replacement for your PyTorchDataLoader
with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader. -
Support for Oracle Cloud Infrastructure (OCI) blob storage
Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either
oci://<bucket_name>@<namespace>/<folder_name>/<filename>
oroci://<bucket_name>/<folder_name>/<filename>
to aStreamingDataset
class. For example:from streaming import StreamingDataset remote = 'oci://<bucket>@<namespace>/<path>' local = '/tmp/dataset/' train_dataset = StreamingDataset(local=local, remote=remote, split='train')
Streaming expects the credentials to be present in
~/.oci/config
path. -
Support for public AWS S3 buckets
Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the
StreamingDataset
class with an AWS S3 bucket as followsfrom streaming import StreamingDataset remote = 's3://<bucket>/<path>' local = '/tmp/dataset/' train_dataset = StreamingDataset(local=local, remote=remote, split='train')
API changes
- The class
Dataset
has been renamed as classStreamingDataset
(#37).- Similarly, built-in most popular datasets class has also been renamed. For example,
C4
renamed asStreamingC4
EnWiki
renamed asStreamingEnWiki
Pile
renamed asStreamingEnWiki
ADE20K
renamed asStreamingADE20K
CIFAR10
renamed asStreamingCIFAR10
COCO
renamed asStreamingCOCO
ImageNet
renamed asStreamingImageNet
- Similarly, built-in most popular datasets class has also been renamed. For example,
- The parameter
prefetch
in classDataset
has been renamed aspredownload
in classStreamingDataset
(#37). - The parameter
retry
in classDataset
has been renamed asdownload_retry
in classStreamingDataset
(#37). - The parameter
timeout
in classDataset
has been renamed asdownload_timeout
in classStreamingDataset
(#37). - The parameter
hash
in classDataset
has been renamed asvalidate_hash
in classStreamingDataset
(#37).
What's Changed
- Bump nbsphinx from 0.8.9 to 0.8.10 by @dependabot in #73
- Bump sphinx-argparse from 0.3.2 to 0.4.0 by @dependabot in #74
- The Pile (conversion + streaming dataset) by @knighton in #71
- [Docs] Switch back to RTD search by @bandish-shah in #83
- make pyright precommit check actually run by @dblalock in #84
- Fixed stale URL references by @bandish-shah in #85
- Bump sphinx-copybutton from 0.5.0 to 0.5.1 by @dependabot in #78
- Bump pandoc from 2.2 to 2.3 by @dependabot in #79
- Bump sphinxcontrib-katex from 0.9.0 to 0.9.3 by @dependabot in #80
- Bump sphinxext-opengraph from 0.7.2 to 0.7.3 by @dependabot in #81
- Support for concat option in C4 Dataset by @karan6181 in #77
- Elastic world size deterministic shuffle with mid-epoch resumption by @knighton in #37
- Support for S3 public bucket by @karan6181 in #88
- Add OCI Cloud Storage support by @karan6181 in #86
- Make StreamingDataset state_dict() more flexible by @knighton in #90
- Bump version to 0.2.0 by @karan6181 in #92
Full Changelog: v0.1.2...v0.2.0