Improving API: part 1: functionality for input pyfive.high_level.Dataset #241

valeriupredoi · 2025-02-25T16:49:43Z

Description

Contribution towards #231

This allows for a type pyfive.high_level.Dataset to be Active-ed. eg (the test I included):

    uri = "tests/test_data/cesm2_native.nc"
    ncvar = "TREFHT"
    ds = pyfive.File(uri)[ncvar]
    av = Active(ds)
    av._method = "min"
    assert av.method([3,444]) == 3
    av_slice_min = av[3:5]
    assert av_slice_min == np.array(258.62814, dtype="float32")

Before you get started

☝ Create an issue to discuss what you are going to do

Checklist

This pull request has a descriptive title and labels
This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
Unit tests have been added (if codecov test fails)
~~[ ] Any changed dependencies have been added or removed correctly (if need be)~~
All tests pass

codecov-commenter · 2025-02-26T16:20:44Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

bnlawrence · 2025-02-27T14:39:14Z

Before I review the code itself, for me the big issue is the high level API.

In #241 we suggest that we could just do things like av.mean(...), but didn't spell out that the various methods (like mean would have to explicitly be included in the library API. In the snippet above, you've suggested we do what we used to do, which is assign a method (av._method="min") and then do av.method(...).

Given the library code itself has to explicitly implement mean in both the python and reductionist stacks, and we have an ongoing wacasoft discussion about how we can evolve the methods which are supported, I think it's ok to be very explicit and fully document all the supported methods directly without the indirection of setting a method and using the "method" attribute.

That means, I'd prefer to see us explicitly allowing av.mean, av.min, av.max, av.count for now, accepting we will need to add new methods as we get storage support.

valeriupredoi · 2025-02-27T15:24:50Z

Before I review the code itself, for me the big issue is the high level API.

In #241 we suggest that we could just do things like av.mean(...), but didn't spell out that the various methods (like mean would have to explicitly be included in the library API. In the snippet above, you've suggested we do what we used to do, which is assign a method (av._method="min") and then do av.method(...).

Given the library code itself has to explicitly implement mean in both the python and reductionist stacks, and we have an ongoing wacasoft discussion about how we can evolve the methods which are supported, I think it's ok to be very explicit and fully document all the supported methods directly without the indirection of setting a method and using the "method" attribute.

That means, I'd prefer to see us explicitly allowing av.mean, av.min, av.max, av.count for now, accepting we will need to add new methods as we get storage support.

yes, that is indeed on the TODO list - I was planning on tackling that in part 2 PR ie review and merge this first, then part 2 explicit stats, and then part 3 clean up (get rid of all those versions etc) 🍺

bnlawrence · 2025-02-28T08:00:04Z

activestorage/active.py

    def __load_nc_file(self):
-        """ Get the netcdf file and it's b-tree"""
+        """ Get the netcdf file and its b-tree"""
        ncvar = self.ncvar
        # in all cases we need an open netcdf file to get at attributes


I am not sure it's an open netcdf file any more is it?

done in 61cf5dd

bnlawrence · 2025-02-28T08:02:44Z

activestorage/active.py

@@ -310,10 +307,6 @@ def _get_selection(self, *args):
        # hopefully fix pyfive to get a dtype directly


I am not sure the docstring for this method, and this comment are quite right any more.

done in 4cdeb1b

bnlawrence · 2025-02-28T08:03:41Z

activestorage/active.py

@@ -362,13 +355,6 @@ def _from_storage(self, ds, indexer, chunks, out_shape, out_dtype, compressor, f
        # Because we do this, we need to read the dataset b-tree now, not as we go, so
        # it is already in cache. If we remove the thread pool from here, we probably
        # wouldn't need to do it before the first one.


This comment is irrelevant now?

This comment is irrelevant now?

That's my understanding

done in 9aca259

bnlawrence · 2025-02-28T08:04:51Z

activestorage/storage.py

+            chunk = chunk.reshape(-1, order='A')
+            chunk = chunk.reshape(shape, order=order)
+    else:
+            class storeinfo: pass


Why don't we import the class so it's more obvious what it's for?

I tried this just now, there is unfortunately flakiness involved:

E AssertionError: assert 265.90347 == array(258.62814, dtype=float32) E + where array(258.62814, dtype=float32) = <built-in function array>(258.62814, dtype='float32') E + where <built-in function array> = np.array tests/unit/test_active.py:104: AssertionError ----------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------ Treating input <HDF5 dataset "TREFHT": shape (12, 4, 8), type "float32"> as variable object. Reducing chunk of object <class 'pyfive.high_level.Dataset'> Reducing chunk of object <class 'pyfive.high_level.Dataset'>XXX StI byte offset 12394 StI size 128 XXX StI byte offset 12522actual offset StI size12394 actual size 128 128 actual offset 12522 actual size 128

with implement:

from pyfive.h5d import StoreInfo as storeinfo storeinfo.byte_offset = offset storeinfo.size = size print("XXX", "StI byte offset", storeinfo.byte_offset, "StI size", storeinfo.size) print("actual offset", offset, "actual size", size)

StoreInfo does not seem to be a safe class the way it is implemented at the moment, since it gets frozen in at times, and doesn't update on the go

bnlawrence

Most of my comments are about documentation. This looks good to go as part of a sequence of events.

davidhassell

Thnaks V - Just some very minor suggestions

activestorage/active.py

davidhassell · 2025-02-28T12:05:48Z

activestorage/active.py

@@ -362,13 +355,6 @@ def _from_storage(self, ds, indexer, chunks, out_shape, out_dtype, compressor, f
        # Because we do this, we need to read the dataset b-tree now, not as we go, so
        # it is already in cache. If we remove the thread pool from here, we probably
        # wouldn't need to do it before the first one.


This comment is irrelevant now?

That's my understanding

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

valeriupredoi · 2025-02-28T13:29:52Z

tests/unit/test_mock_s3.py

+        av._method = "min"
+        assert av.method([3,444]) == 3
+        av_slice_min = av[3:5]
+        assert av_slice_min == np.array(249.6583, dtype="float32")


@bnlawrence @davidhassell this test works a treat, many thanks for reminding me about my own work that I forgot 🤣 I am still keeping the actual real-world test with Reductionist for now though, just so we are fully covered, until the end (ie when we done with work o Pyfive)

valeriupredoi added 2 commits February 25, 2025 16:48

start api changes

97d6360

start api changes

d4f0588

valeriupredoi changed the base branch from main to pyfive February 25, 2025 16:51

valeriupredoi added 5 commits February 26, 2025 14:47

set structure

3b6f0ce

add test

bf0c3fd

actual test with pyfive variable

35a29d5

actual test with pyfive variable

28d97e5

clean up

25668bd

valeriupredoi added 8 commits February 26, 2025 17:23

clean up and add bits

8cdd4cb

add chunking test case

a2f41ab

add exception if not pyfive dataset

e471dee

test for that exception

620e2b6

start reduce chunk with correct syntax

dca443f

it bloody works

1bd40f5

it bloody works

0bdc310

add inline comment

9f46ab9

valeriupredoi changed the title ~~Improving API~~ Improving API: part 1: functionality for input pyfive.high_level.Dataset Feb 27, 2025

valeriupredoi added the enhancement New feature or request label Feb 27, 2025

valeriupredoi requested review from davidhassell and bnlawrence February 27, 2025 14:22

valeriupredoi added 3 commits February 27, 2025 15:10

correct handling for s3

0c562e1

run s3 tests

74f0c26

add real world s3 dataset test

51e27ca

valeriupredoi added 4 commits February 27, 2025 15:29

add note to test

2499066

remove leftover

c495019

unused import

ad3fb54

unused return

9bf31bd

bnlawrence reviewed Feb 28, 2025

View reviewed changes

bnlawrence approved these changes Feb 28, 2025

View reviewed changes

valeriupredoi added 3 commits February 28, 2025 12:07

add correct function docstring

61cf5dd

removed obsolete inline

4cdeb1b

remove obsolete inline

9aca259

davidhassell approved these changes Feb 28, 2025

View reviewed changes

valeriupredoi and others added 5 commits February 28, 2025 12:36

Update activestorage/active.py

2507fe4

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Update activestorage/active.py

17684aa

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

Update activestorage/active.py

b80862d

Co-authored-by: David Hassell <davidhassell@users.noreply.github.com>

fix test

ebd54e8

test mock s3 dataset

050bc8f

valeriupredoi commented Feb 28, 2025

View reviewed changes

valeriupredoi merged commit 5177b06 into pyfive Feb 28, 2025
9 checks passed

valeriupredoi deleted the new_api_pyfive branch February 28, 2025 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving API: part 1: functionality for input pyfive.high_level.Dataset #241

Improving API: part 1: functionality for input pyfive.high_level.Dataset #241

valeriupredoi commented Feb 25, 2025 •

edited

Loading

codecov-commenter commented Feb 26, 2025

bnlawrence commented Feb 27, 2025

valeriupredoi commented Feb 27, 2025

bnlawrence Feb 28, 2025

valeriupredoi Feb 28, 2025

bnlawrence Feb 28, 2025

valeriupredoi Feb 28, 2025

bnlawrence Feb 28, 2025

davidhassell Feb 28, 2025

valeriupredoi Feb 28, 2025

bnlawrence Feb 28, 2025

valeriupredoi Feb 28, 2025

valeriupredoi Feb 28, 2025

bnlawrence left a comment

davidhassell left a comment

davidhassell Feb 28, 2025

valeriupredoi Feb 28, 2025

		@@ -310,10 +307,6 @@ def _get_selection(self, *args):
		# hopefully fix pyfive to get a dtype directly

Improving API: part 1: functionality for input pyfive.high_level.Dataset #241

Improving API: part 1: functionality for input pyfive.high_level.Dataset #241

Conversation

valeriupredoi commented Feb 25, 2025 • edited Loading

Description

Before you get started

Checklist

codecov-commenter commented Feb 26, 2025

Welcome to Codecov 🎉

bnlawrence commented Feb 27, 2025

valeriupredoi commented Feb 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnlawrence left a comment

Choose a reason for hiding this comment

davidhassell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

valeriupredoi commented Feb 25, 2025 •

edited

Loading