Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve naming: JSON shards are actually JSONL, etc. #537

Merged
merged 8 commits into from
Dec 15, 2023
Merged

Conversation

knighton
Copy link
Contributor

@knighton knighton commented Dec 15, 2023

In this fleeting moment of Christmastime joy in which we Purge the dev branch,

  • JSON shards were named incorrectly from the beginning. They are actually JSONL. We can fix that while maintaining backwards compatibility with all existing serialized datasets. Let's take this opportunity (i.e., a single big change when dev is merged, which is already priced-in) to do so.

  • Formerly there were Reader/Writer base classes, with two sub-base classes each: JointReader/JointWriter and SplitReader/SplitWriter. If you watch github file access patterns for long enough, you start to see trends indicative of people very reasonably but very mistakenly believing this has something to do with dataset splits (it is actually about how a given shard format is split over multiple files vs a single file). Now, it's Shard/Writer -> MonoShard/MonoWriter and DualShard/DualWriter, which eliminates that possibility of confusion.

  • Nobody outside of Streaming repo ever deals in individual shards. Furthermore, the relevant classes are not and have never been publicly exposed. This is a safe move.

@knighton knighton merged commit 3972c9d into dev Dec 15, 2023
5 checks passed
@knighton knighton deleted the james/shard-lingo branch December 15, 2023 15:47
karan6181 pushed a commit that referenced this pull request Jan 26, 2024
* Stdize docstrings, also fix ordering of get_sample_data, decode_sample.

* Terminology: "joint" -> "mono".

* "split" -> "dual" to stop confusing people (SplitWriter != dataaset splits)

* "Reader" -> "Shard". They manage shards. They do more than read.

* Fix filenames accordingly.

* Finally, JSON -> JSONL.

* Switch order of decorators...

* Fix markdown code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant