Skip to content

Commit

Permalink
Add functionality to load/save distisets to/from disk (#673)
Browse files Browse the repository at this point in the history
* Add functionality to load/save distisets to/from disk

* Add tests for saving/loading distiset from disk

* Add functionality to load/save distisets to/from disk

* Update docs

* Include code blocks from Examples in docstrings

* Add tests for the dataset card

* Fix call to yaml.safe_load found in code review

* Copy path movements from hugging face load_from_disk definition

* Add universal_pathlib dependency to better deal with remote paths when calling Distiset.load_from_disk

* Fix download of distiset and add option to write the data to a user specified dir

* Remove parameter in test as it isn't really tested with a remote filesystem

* Remove unnecessary markdown extension and fix type from variables

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Update src/distilabel/distiset.py

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>

* Cast Path to str

---------

Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
  • Loading branch information
plaguss and gabrielmbmb authored May 29, 2024
1 parent bce7da1 commit 7e9230b
Show file tree
Hide file tree
Showing 4 changed files with 435 additions and 40 deletions.
25 changes: 25 additions & 0 deletions docs/sections/learn/advanced/distiset.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,31 @@ distiset.push_to_hub(
)
```

### Save and load from disk

Saves the [`Distiset`][distilabel.distiset.Distiset] to disk, and optionally (will be done by default) saves the dataset card, the pipeline config file and logs:

```python
distiset.save_to_disk(
"my-dataset",
save_card=True,
save_pipeline_config=True,
save_pipeline_log=True
)
```

And load a [`Distiset`][distilabel.distiset.Distiset] that was saved using [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk] from disk just the same way:

```python
from distilabel.distiset import Distiset

distiset = Distiset.save_to_disk("my-dataset")
```

Take into account that these methods pass work as `datasets.load_from_disk` and `datasets.Dataset.save_to_disk` so the arguments are directly passed to those methods. This means you can also make use of `storage_options` argument to save your [`Distiset`][distilabel.distiset.Distiset] in your cloud provider, including the distilabel artifacts (`pipeline.yaml`, `pipeline.log` and the `README.md` with the dataset card), you can read more in `datasets` documentation [here](https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets).

Take a look at the remaining arguments at [`Distiset.save_to_disk`][distilabel.distiset.Distiset.save_to_disk].

## Dataset card

Having this special type of dataset comes with an added advantage when calling [`Distiset.push_to_hub`][distilabel.distiset.Distiset], which is the automatically generated dataset card in the Hugging Face Hub. Note that it is enabled by default, but can be disabled by setting `generate_card=False`:
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ dependencies = [
"typer >= 0.9.0",
"tblib >= 3.0.0",
"orjson >= 3.10.0",
"universal_pathlib >= 0.2.2",
]
dynamic = ["version"]

Expand Down
Loading

0 comments on commit 7e9230b

Please sign in to comment.