Improve naming: JSON shards are actually JSONL, etc. #537

knighton · 2023-12-15T15:14:26Z

In this fleeting moment of Christmastime joy in which we Purge the dev branch,

JSON shards were named incorrectly from the beginning. They are actually JSONL. We can fix that while maintaining backwards compatibility with all existing serialized datasets. Let's take this opportunity (i.e., a single big change when dev is merged, which is already priced-in) to do so.
Formerly there were Reader/Writer base classes, with two sub-base classes each: JointReader/JointWriter and SplitReader/SplitWriter. If you watch github file access patterns for long enough, you start to see trends indicative of people very reasonably but very mistakenly believing this has something to do with dataset splits (it is actually about how a given shard format is split over multiple files vs a single file). Now, it's Shard/Writer -> MonoShard/MonoWriter and DualShard/DualWriter, which eliminates that possibility of confusion.
Nobody outside of Streaming repo ever deals in individual shards. Furthermore, the relevant classes are not and have never been publicly exposed. This is a safe move.

…plits)

* Stdize docstrings, also fix ordering of get_sample_data, decode_sample. * Terminology: "joint" -> "mono". * "split" -> "dual" to stop confusing people (SplitWriter != dataaset splits) * "Reader" -> "Shard". They manage shards. They do more than read. * Fix filenames accordingly. * Finally, JSON -> JSONL. * Switch order of decorators... * Fix markdown code.

knighton added 8 commits December 15, 2023 05:42

Stdize docstrings, also fix ordering of get_sample_data, decode_sample.

ba2ae1a

Terminology: "joint" -> "mono".

7ba2e32

"split" -> "dual" to stop confusing people (SplitWriter != dataaset s…

6199644

…plits)

"Reader" -> "Shard". They manage shards. They do more than read.

e2f3a83

Fix filenames accordingly.

7cf0ef3

Finally, JSON -> JSONL.

3b742c0

Switch order of decorators...

9fab950

Fix markdown code.

a830f53

knighton requested review from karan6181 and bandish-shah as code owners December 15, 2023 15:31

knighton merged commit 3972c9d into dev Dec 15, 2023
5 checks passed

knighton deleted the james/shard-lingo branch December 15, 2023 15:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve naming: JSON shards are actually JSONL, etc. #537

Improve naming: JSON shards are actually JSONL, etc. #537

knighton commented Dec 15, 2023 •

edited

Loading

Improve naming: JSON shards are actually JSONL, etc. #537

Improve naming: JSON shards are actually JSONL, etc. #537

Conversation

knighton commented Dec 15, 2023 • edited Loading

knighton commented Dec 15, 2023 •

edited

Loading