Skip to content

v0.2.0

Compare
Choose a tag to compare
@github-actions github-actions released this 09 Dec 06:44
1067f1b

🚀 Streaming v0.2.0

Streaming v0.2.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.0

New Features

  1. Elastic world size deterministic shuffle

    Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.

  2. Instant Mid-Epoch Resumption

    Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.

  3. NEW StreamingDataLoader
    A StreamingDataLoader is a drop-in replacement for your PyTorch DataLoader with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader.

  4. Support for Oracle Cloud Infrastructure (OCI) blob storage

    Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either oci://<bucket_name>@<namespace>/<folder_name>/<filename> or oci://<bucket_name>/<folder_name>/<filename> to a StreamingDataset class. For example:

    from streaming import StreamingDataset
    
    remote = 'oci://<bucket>@<namespace>/<path>'
    local = '/tmp/dataset/'
    
    train_dataset = StreamingDataset(local=local, remote=remote, split='train')

    Streaming expects the credentials to be present in ~/.oci/config path.

  5. Support for public AWS S3 buckets

    Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the StreamingDataset class with an AWS S3 bucket as follows

    from streaming import StreamingDataset
    
    remote = 's3://<bucket>/<path>'
    local = '/tmp/dataset/'
    
    train_dataset = StreamingDataset(local=local, remote=remote, split='train')
    

API changes

  • The class Dataset has been renamed as class StreamingDataset (#37).
    • Similarly, built-in most popular datasets class has also been renamed. For example,
      • C4 renamed as StreamingC4
      • EnWiki renamed as StreamingEnWiki
      • Pile renamed as StreamingEnWiki
      • ADE20K renamed as StreamingADE20K
      • CIFAR10 renamed as StreamingCIFAR10
      • COCO renamed as StreamingCOCO
      • ImageNet renamed as StreamingImageNet
  • The parameter prefetch in class Dataset has been renamed as predownload in class StreamingDataset (#37).
  • The parameter retry in class Dataset has been renamed as download_retry in class StreamingDataset (#37).
  • The parameter timeout in class Dataset has been renamed as download_timeout in class StreamingDataset (#37).
  • The parameter hash in class Dataset has been renamed as validate_hash in class StreamingDataset (#37).

What's Changed

Full Changelog: v0.1.2...v0.2.0