Skip to content

Latest commit

 

History

History
18 lines (12 loc) · 1.01 KB

README.md

File metadata and controls

18 lines (12 loc) · 1.01 KB

TinyStories

This tutorial demonstrates the usage of NeMo Curator's Python API to curate the TinyStories dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.

For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.

Walkthrough

For a detailed walkthrough of this tutorial, please see the following blog post:

Usage

After installing the NeMo Curator package, you can simply run the following command:

python tutorials/tinystories/main.py

This will download the validation split of the TinyStories dataset and begin the data curation pipeline.