Critical path to a formal pull request upstream #21

bnlawrence · 2025-01-06T15:43:41Z

Minutes of our meeting on the necessary steps before the upstream pull request:

@bnlawrence Add the pseudo chunking option to the contiguous storage with keyword control over chunk size.
@davidhassell to convince himself that this will work in cf-python (we are trying to ensure we don't have to make any forseeable changes here because they could complicate the move upstream).
@valeriupredoi will review our new pyactivestorage API (as above, to avoid forseeable changes).
@bnlawrence to merge V's remote http branch into the h5netcdf branch and archive the others.
We had to keep the files open in DatasetID to satisfy h5netcdf unit tests, but we really don't want to do that at scale. @davidhassell is going to look at the relevant ticket (Why not caching h5py dataset? h5netcdf/h5netcdf#251) on h5netcdf (@bnlawrence will create another branch on our pyfive which can be used for re-exposing the issue).
@bnlawrence to update our main and then do a pull request onto our main, so that @davidhassell and @valeriupredoi can do a code review (but we wont merge onto our main, we'll do the pull request from our branch upstream).
@valeriupredoi will have a look at adding unit tests for the new h5d.py code.
@valeriupredoi will look into how we can properly incorporate s3 and remote http testing and discuss upstream

The text was updated successfully, but these errors were encountered:

bnlawrence · 2025-01-06T16:01:51Z

Some ticket updates:

A ticket for the pseudo chunking: Pseudo chunking for contiguous variables #22
(@davidhassell) the upstream h5netcdf ticket on file opening is also related to our own h5netcdf file opening ticket: File opening is not lazy enough h5netcdf#4

bnlawrence · 2025-01-08T08:04:32Z

"Bryan will archive the others" ... I'm just going to delete them, but record their hashes here so we can get them back in the unlikely event we ever need to:

h5pyapi: c21ee63
play: 9ac0bbd
issue60: 67c93e0
issue 6: 1f61d6c
https_compat: 7f4aec0

bnlawrence · 2025-01-09T12:20:02Z

Actions from today's meeting, where most of our conversation was around the expected behaviour when threading:

I need to to write up "the problem" which David has exposed (I killed off the action above, this is the new one)
@davidhassell to create a simple unit test with just Dask and pyfive
@valeriupredoi to create a new branch in pyfive based on the h5netcdf branch which has his wizzy new S3 testing stuff.

bnlawrence · 2025-01-09T12:44:33Z

The problems I think we have are encapsulated here:

import pyfive
import s3fs 

s3 = s3fs.S3FileSystem("http://some-s3-server/")
# thread zone 1
with s3.open('my-bucket/my-file.txt', 'rb') as f:
    with pyfive.File(f) as hfile:
          uwind = hfile['zonal_velocity']
    ### thread zone 2
    r = uwind[x:y] #where x and y are thread dependent.
    ###
rr = uwind[xx:yy]
## end of zone 2

We know that the threading around r requires us to open and close the posix file f within each thread's DatasetId instance, so we would assume we have to do the same thing for s3. The question then arises: but these are all sharing the same s3 parent. We're not really doing real seeks, so does it work? What happens if several threads request different blocks at the "same time"? How thread safe is this? What is going on with caching?
What about threading around rr, we assume that's effectively the same problem (all the real caching is happening in the s3 instance not the f instance)?

We assume that threading higher up the stack would be ok, albeit expensive with caching etc.

bnlawrence · 2025-01-15T08:59:09Z

Progress update:

S3 test environment is now included (delivered in Mock s3fs testing framework #26), so we need some tests that use it.
David's cf-python checks are encapsulated in Threadsafe data access (Posix and non-Posix) #27 - so I've ticked the check box above. Merging that pull request will deliver on it and maybe address the problem above.

bnlawrence · 2025-01-15T16:51:09Z

Pseudo chunking delivered in f450776.

Killed the relevant branch.

bnlawrence · 2025-01-21T07:46:15Z

We have a show stopper issue - variable length strings. I was aware of this (but thought we could live with out it for now, #16), but @davidhassell has shown it is a real problem for real data we use - #29.

bnlawrence · 2025-01-22T15:59:30Z

(Vlen support dealt with.)

bnlawrence · 2025-01-30T15:37:10Z

Pull request submitted.

bnlawrence added this to the h5netcdf ready milestone Jan 6, 2025

bnlawrence mentioned this issue Jan 8, 2025

unable to read attributes when the size of attribute lists is relatively big jjhelmus/pyfive#41

Open

bnlawrence mentioned this issue Jan 8, 2025

Accessing properties and data outside of the context manager #24

Closed

bnlawrence mentioned this issue Jan 15, 2025

h5netcdf support #7

Closed

valeriupredoi mentioned this issue Jan 20, 2025

Fix pyfive branch with latest Pyfive branch h5netdf NCAS-CMS/PyActiveStorage#234

Merged

bnlawrence closed this as completed Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Critical path to a formal pull request upstream #21

Critical path to a formal pull request upstream #21

bnlawrence commented Jan 6, 2025 •

edited

Loading

bnlawrence commented Jan 6, 2025

bnlawrence commented Jan 8, 2025

bnlawrence commented Jan 9, 2025 •

edited

Loading

bnlawrence commented Jan 9, 2025 •

edited

Loading

bnlawrence commented Jan 15, 2025

bnlawrence commented Jan 15, 2025

bnlawrence commented Jan 21, 2025

bnlawrence commented Jan 22, 2025

bnlawrence commented Jan 30, 2025

Critical path to a formal pull request upstream #21

Critical path to a formal pull request upstream #21

Comments

bnlawrence commented Jan 6, 2025 • edited Loading

bnlawrence commented Jan 6, 2025

bnlawrence commented Jan 8, 2025

bnlawrence commented Jan 9, 2025 • edited Loading

bnlawrence commented Jan 9, 2025 • edited Loading

bnlawrence commented Jan 15, 2025

bnlawrence commented Jan 15, 2025

bnlawrence commented Jan 21, 2025

bnlawrence commented Jan 22, 2025

bnlawrence commented Jan 30, 2025

bnlawrence commented Jan 6, 2025 •

edited

Loading

bnlawrence commented Jan 9, 2025 •

edited

Loading

bnlawrence commented Jan 9, 2025 •

edited

Loading