Improve naming: JSON shards are actually JSONL, etc. #537
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In this fleeting moment of Christmastime joy in which we Purge the
dev
branch,JSON shards were named incorrectly from the beginning. They are actually JSONL. We can fix that while maintaining backwards compatibility with all existing serialized datasets. Let's take this opportunity (i.e., a single big change when dev is merged, which is already priced-in) to do so.
Formerly there were Reader/Writer base classes, with two sub-base classes each: JointReader/JointWriter and SplitReader/SplitWriter. If you watch github file access patterns for long enough, you start to see trends indicative of people very reasonably but very mistakenly believing this has something to do with dataset splits (it is actually about how a given shard format is split over multiple files vs a single file). Now, it's Shard/Writer -> MonoShard/MonoWriter and DualShard/DualWriter, which eliminates that possibility of confusion.
Nobody outside of Streaming repo ever deals in individual shards. Furthermore, the relevant classes are not and have never been publicly exposed. This is a safe move.