Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving API: part 1: functionality for input pyfive.high_level.Dataset #241

Merged
merged 30 commits into from
Feb 28, 2025

Conversation

valeriupredoi
Copy link
Collaborator

@valeriupredoi valeriupredoi commented Feb 25, 2025

Description

Contribution towards #231

This allows for a type pyfive.high_level.Dataset to be Active-ed. eg (the test I included):

    uri = "tests/test_data/cesm2_native.nc"
    ncvar = "TREFHT"
    ds = pyfive.File(uri)[ncvar]
    av = Active(ds)
    av._method = "min"
    assert av.method([3,444]) == 3
    av_slice_min = av[3:5]
    assert av_slice_min == np.array(258.62814, dtype="float32")

Before you get started

Checklist

  • This pull request has a descriptive title and labels
  • This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
  • Unit tests have been added (if codecov test fails)
  • [ ] Any changed dependencies have been added or removed correctly (if need be)
  • All tests pass

@valeriupredoi valeriupredoi changed the base branch from main to pyfive February 25, 2025 16:51
@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@valeriupredoi valeriupredoi changed the title Improving API Improving API: part 1: functionality for input pyfive.high_level.Dataset Feb 27, 2025
@valeriupredoi valeriupredoi added the enhancement New feature or request label Feb 27, 2025
@bnlawrence
Copy link
Collaborator

Before I review the code itself, for me the big issue is the high level API.

In #241 we suggest that we could just do things like av.mean(...), but didn't spell out that the various methods (like mean would have to explicitly be included in the library API. In the snippet above, you've suggested we do what we used to do, which is assign a method (av._method="min") and then do av.method(...).

Given the library code itself has to explicitly implement mean in both the python and reductionist stacks, and we have an ongoing wacasoft discussion about how we can evolve the methods which are supported, I think it's ok to be very explicit and fully document all the supported methods directly without the indirection of setting a method and using the "method" attribute.

That means, I'd prefer to see us explicitly allowing av.mean, av.min, av.max, av.count for now, accepting we will need to add new methods as we get storage support.

@valeriupredoi
Copy link
Collaborator Author

Before I review the code itself, for me the big issue is the high level API.

In #241 we suggest that we could just do things like av.mean(...), but didn't spell out that the various methods (like mean would have to explicitly be included in the library API. In the snippet above, you've suggested we do what we used to do, which is assign a method (av._method="min") and then do av.method(...).

Given the library code itself has to explicitly implement mean in both the python and reductionist stacks, and we have an ongoing wacasoft discussion about how we can evolve the methods which are supported, I think it's ok to be very explicit and fully document all the supported methods directly without the indirection of setting a method and using the "method" attribute.

That means, I'd prefer to see us explicitly allowing av.mean, av.min, av.max, av.count for now, accepting we will need to add new methods as we get storage support.

yes, that is indeed on the TODO list - I was planning on tackling that in part 2 PR ie review and merge this first, then part 2 explicit stats, and then part 3 clean up (get rid of all those versions etc) 🍺

def __load_nc_file(self):
""" Get the netcdf file and it's b-tree"""
""" Get the netcdf file and its b-tree"""
ncvar = self.ncvar
# in all cases we need an open netcdf file to get at attributes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it's an open netcdf file any more is it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 61cf5dd

@@ -310,10 +307,6 @@ def _get_selection(self, *args):
# hopefully fix pyfive to get a dtype directly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure the docstring for this method, and this comment are quite right any more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 4cdeb1b

@@ -362,13 +355,6 @@ def _from_storage(self, ds, indexer, chunks, out_shape, out_dtype, compressor, f
# Because we do this, we need to read the dataset b-tree now, not as we go, so
# it is already in cache. If we remove the thread pool from here, we probably
# wouldn't need to do it before the first one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is irrelevant now?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is irrelevant now?

That's my understanding

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 9aca259

chunk = chunk.reshape(-1, order='A')
chunk = chunk.reshape(shape, order=order)
else:
class storeinfo: pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we import the class so it's more obvious what it's for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this just now, there is unfortunately flakiness involved:

E       AssertionError: assert 265.90347 == array(258.62814, dtype=float32)
E        +  where array(258.62814, dtype=float32) = <built-in function array>(258.62814, dtype='float32')
E        +    where <built-in function array> = np.array

tests/unit/test_active.py:104: AssertionError
----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------
Treating input <HDF5 dataset "TREFHT": shape (12, 4, 8), type "float32"> as variable object.
Reducing chunk of object <class 'pyfive.high_level.Dataset'>
Reducing chunk of object <class 'pyfive.high_level.Dataset'>XXX StI byte offset 12394 StI size 128
XXX 
StI byte offset 12522actual offset  StI size12394 actual size 128
 128
actual offset 12522 actual size 128

with implement:

            from pyfive.h5d import StoreInfo as storeinfo
            storeinfo.byte_offset = offset
            storeinfo.size = size
            print("XXX", "StI byte offset", storeinfo.byte_offset, "StI size", storeinfo.size)
            print("actual offset", offset, "actual size", size)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoreInfo does not seem to be a safe class the way it is implemented at the moment, since it gets frozen in at times, and doesn't update on the go

Copy link
Collaborator

@bnlawrence bnlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are about documentation. This looks good to go as part of a sequence of events.

Copy link
Collaborator

@davidhassell davidhassell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnaks V - Just some very minor suggestions

@@ -362,13 +355,6 @@ def _from_storage(self, ds, indexer, chunks, out_shape, out_dtype, compressor, f
# Because we do this, we need to read the dataset b-tree now, not as we go, so
# it is already in cache. If we remove the thread pool from here, we probably
# wouldn't need to do it before the first one.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is irrelevant now?

That's my understanding

valeriupredoi and others added 5 commits February 28, 2025 12:36
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>
av._method = "min"
assert av.method([3,444]) == 3
av_slice_min = av[3:5]
assert av_slice_min == np.array(249.6583, dtype="float32")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnlawrence @davidhassell this test works a treat, many thanks for reminding me about my own work that I forgot 🤣 I am still keeping the actual real-world test with Reductionist for now though, just so we are fully covered, until the end (ie when we done with work o Pyfive)

@valeriupredoi valeriupredoi merged commit 5177b06 into pyfive Feb 28, 2025
9 checks passed
@valeriupredoi valeriupredoi deleted the new_api_pyfive branch February 28, 2025 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants