Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Critical path to a formal pull request upstream #21

Closed
8 tasks done
bnlawrence opened this issue Jan 6, 2025 · 9 comments
Closed
8 tasks done

Critical path to a formal pull request upstream #21

bnlawrence opened this issue Jan 6, 2025 · 9 comments

Comments

@bnlawrence
Copy link
Collaborator

bnlawrence commented Jan 6, 2025

Minutes of our meeting on the necessary steps before the upstream pull request:

  • @bnlawrence Add the pseudo chunking option to the contiguous storage with keyword control over chunk size.
  • @davidhassell to convince himself that this will work in cf-python (we are trying to ensure we don't have to make any forseeable changes here because they could complicate the move upstream).
  • @valeriupredoi will review our new pyactivestorage API (as above, to avoid forseeable changes).
  • @bnlawrence to merge V's remote http branch into the h5netcdf branch and archive the others.
  • We had to keep the files open in DatasetID to satisfy h5netcdf unit tests, but we really don't want to do that at scale. @davidhassell is going to look at the relevant ticket (Why not caching h5py dataset? h5netcdf/h5netcdf#251) on h5netcdf (@bnlawrence will create another branch on our pyfive which can be used for re-exposing the issue).
  • @bnlawrence to update our main and then do a pull request onto our main, so that @davidhassell and @valeriupredoi can do a code review (but we wont merge onto our main, we'll do the pull request from our branch upstream).
  • @valeriupredoi will have a look at adding unit tests for the new h5d.py code.
  • @valeriupredoi will look into how we can properly incorporate s3 and remote http testing and discuss upstream
@bnlawrence bnlawrence added this to the h5netcdf ready milestone Jan 6, 2025
@bnlawrence
Copy link
Collaborator Author

Some ticket updates:

@bnlawrence
Copy link
Collaborator Author

"Bryan will archive the others" ... I'm just going to delete them, but record their hashes here so we can get them back in the unlikely event we ever need to:

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Jan 9, 2025

Actions from today's meeting, where most of our conversation was around the expected behaviour when threading:

  • I need to to write up "the problem" which David has exposed (I killed off the action above, this is the new one)
  • @davidhassell to create a simple unit test with just Dask and pyfive
  • @valeriupredoi to create a new branch in pyfive based on the h5netcdf branch which has his wizzy new S3 testing stuff.

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Jan 9, 2025

The problems I think we have are encapsulated here:

import pyfive
import s3fs 

s3 = s3fs.S3FileSystem("http://some-s3-server/")
# thread zone 1
with s3.open('my-bucket/my-file.txt', 'rb') as f:
    with pyfive.File(f) as hfile:
          uwind = hfile['zonal_velocity']
    ### thread zone 2
    r = uwind[x:y] #where x and y are thread dependent.
    ###
rr = uwind[xx:yy]
## end of zone 2  
  1. We know that the threading around r requires us to open and close the posix file f within each thread's DatasetId instance, so we would assume we have to do the same thing for s3. The question then arises: but these are all sharing the same s3 parent. We're not really doing real seeks, so does it work? What happens if several threads request different blocks at the "same time"? How thread safe is this? What is going on with caching?
  2. What about threading around rr, we assume that's effectively the same problem (all the real caching is happening in the s3 instance not the f instance)?

We assume that threading higher up the stack would be ok, albeit expensive with caching etc.

@bnlawrence
Copy link
Collaborator Author

Progress update:

@bnlawrence
Copy link
Collaborator Author

Pseudo chunking delivered in f450776.

Killed the relevant branch.

@bnlawrence
Copy link
Collaborator Author

We have a show stopper issue - variable length strings. I was aware of this (but thought we could live with out it for now, #16), but @davidhassell has shown it is a real problem for real data we use - #29.

@bnlawrence
Copy link
Collaborator Author

(Vlen support dealt with.)

@bnlawrence
Copy link
Collaborator Author

Pull request submitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant