-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canonical File Transformations #585
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also add delta
branch name under branches
for files
So that CI runs for the delta branch when you create a PR.
- index_download_num_procs - index_download_procs_per_cpu - index_download_max_procs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some testing issues in addition to these comments
outdated
Canonical file transformations
1. StreamingDataset
1.1. Organization of StreamingDataset init args
epoch_size
streams
remote
local
split
index_size
<- newindex_hashes
<- newallow_schema_mismatch
<- newallow_unsafe_types
allow_unchecked_resumption
<- newdownload_retry
download_timeout
download_max_size
<- newvalidate_hash
keep_phases
<- newpredownload
cache_limit
shuffle_seed
sampling_method
sampling_granularity
partition_algo
num_canonical_nodes
batch_size
shuffle
shuffle_algo
shuffle_seed
shuffle_block_size
batching_method
1.2. StreamingDataset interface
2. Stream
2.1. Stream divided into three shockingly modularizable parts
Stream
still has pretty much the same API imported from the same place. However, under the hood, it inherits most of its functionality from two base classes or mixins:StreamWeightConf
(everything to do with weighting streams) andStreamDirConf
(handling for all other stream args). What remains forStream
itself to do is index loading/having the shards.2.2. Stream > Shard > StreamDirConf > Stream > ...
The shard API is rewritten to be a lot nicer, with methods like
download()
andevict()
which take no args. It does this by keeping a reference back to its owningStreamDirConf
, which is mostly aboutStream
arguments which are shared by all itsShard
s. And recalling thatStreamDirConf
is an ancestor ofStream
, which technically owns itsShard
s, we now cue triumphant "it's the circle of life" Lion King music.Alternative 1: look up the owning Stream and pass large numbers of its arguments to every call to a Shard method, which would be unpleasant and not help anything.
Alternative 2: have the functionality live in Stream, and look up the shard on every call to a shard method, which would be unworkable because there is a class hierarhcy of Shards.
Alternative 3: Do it functionally, which would present the same annoyances around Shard subclassing with the added annoyance of having literally all the things needed having from both stream and shard having to be passed in as args.
2.3. New Stream args replicated into StreamingDataset
Title.
2.4. Stream init sequence is redesigned to parallelize index downloading
Although there is a faster way within our reach here: using
pool.imap_unordered
and noting the IDs of Streams whose indexes have been downloaded to a shared memory array and initting Streams in that order. Future PR.2.5. StreamDirConf interface (inherited by Stream)
2.6. StreamWeightConf interface (inherited by Stream)
2.7. Stream interface (inherits from StreamDirConf and StreamWeightConf)
3. Shard
3.1. The new 3-phase shard lifecycle
(Zip != Raw != Can) = shard file.
3.2. Terminology for prepping shards
You fetch, you access, then finally you evict.
Fetching consists of downloading and unpacking.
Unpacking consists of decompressing and canonicalizing.
3.3. Mapping use cases to phases to cache
3.4. Phaser (i.e., shard file phase cacher-deleter) API
3.5. Relating shard formats, shard file phases, and use cases in practice
3.6. Writers and Shards correspond, as before
Writers write Streaming dataset directories, one shard at a time.
Shards read individual Streaming shards of dataset directories.
3.7. Shard composition
A shard is realized as one or more files.
A file has from one to three forms, which are called phases.
At a high level, the phases all contain the same information, but stored differently for different use cases, e.g. minimizing download size vs minimizing random access time.
3.8. Shard interface (internal)
3.9. ShardFile interface (internal)
3.10. ShardFilePhase interface (internal)
4. Appendix
Source tree changes