Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_zarr_custom creates large temporary folders #53

Closed
tcompa opened this issue Jun 22, 2022 · 8 comments
Closed

to_zarr_custom creates large temporary folders #53

tcompa opened this issue Jun 22, 2022 · 8 comments
Assignees
Labels
Backlog Backlog issues we may eventually fix, but aren't a priority

Comments

@tcompa
Copy link
Collaborator

tcompa commented Jun 22, 2022

Branching from fractal-analytics-platform/fractal-client#62 (comment)

We currently use a custom function to allow overwrite=True in dask.array.to_zarr(), due to the issue described in dask/dask#5942. Our current code is

def to_zarr_custom(newzarrurl=None, array=None, component="", overwrite=False):

    [...]

    if overwrite:
        tmp_suffix = "_TEMPORARY_ZARR_ARRAY"
        array.to_zarr(
            newzarrurl,
            component=component + tmp_suffix,
            dimension_separator="/",
        )
        shutil.rmtree(newzarrurl + component)
        shutil.move(
            newzarrurl + component + tmp_suffix, newzarrurl + component
        )
    else:
        array.to_zarr(newzarrurl, component=component, dimension_separator="/")

Each time our custom to_zarr_custom is called, it first writes a level to a temporary zarr subfolder, then removes the original one, then moves the temporary one to the original one's path.

In a realistic case with 23 wells of about 45G each (let's say that the high-resolution level takes ~30G), about 700G of temporary folders are created (hopefully not all of them at the same time). When disk space is tight, or when (unlikely) several tasks finish around the same time, or for even larger datasets, creating such huge temporary folders could be an issue.

Proposed solution: when overwrite=True, we may first remove the original subfolder and then write the new one directly into its correct path.

@tcompa tcompa self-assigned this Jun 22, 2022
@tcompa tcompa changed the title Custom to_zarr has creates large temporary folders to_zarr_custom creates large temporary folders Jun 22, 2022
@jluethi
Copy link
Collaborator

jluethi commented Jun 27, 2022

This overhead could become quite limiting in large experiments, as you describe. I'd be a bit worried with a solution that removes originals before new data was saved. We should have a fail state that allows the user to rerun the task / the parts of the task that failed. Otherwise, if say a cluster node fails at a bad moment, the experiment looses data.

Is this actually still an issue in dask? The corresponding PR mentions that they believe the issue doesn't exist anymore: dask/dask#7379

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 27, 2022

Is this actually still an issue in dask? The corresponding PR mentions that they believe the issue doesn't exist anymore: dask/dask#7379

The PR was abandoned (and rightly so, as it would correspond to loading the whole data to memory before saving to disk, if I remember correctly), and the original issue is still open. The last message reads "[..] I think there will be a problem for array larger than memory, because if i understood correctly the graph all data must be loaded before to start storage.", and then there's not been any further activity.

Thus yes, I'd say this is still an open issue.

@jluethi
Copy link
Collaborator

jluethi commented Jun 27, 2022

How much effort would it be to create a small test case we could use to show this issue and report back with that to the zarr issue then / (maybe better create a new one referencing this)? I think that issue stopped because there wasn't a clear test to see it fail.
Or maybe we look for other people who have created workarounds for this already? We can't be the only ones hitting this

@tcompa
Copy link
Collaborator Author

tcompa commented Jun 27, 2022

I saw this comment right after pinging dask people: dask/dask#5942 (comment)

How much effort would it be to create a small test case we could use to show this issue and report back with that to the zarr issue then / (maybe better create a new one referencing this)? I think that issue stopped because there wasn't a clear test to see it fail.

There is a very clear test to see it fail: dask/dask#5942 (comment).
Still, if someone answers on that issue, then I am happy to create a new test (starting from here). It's not actually needed, but I'll do it if it's to keep the discussion alive.

Or maybe we look for other people who have created workarounds for this already?

The only one that was shared is the one in the (abandoned) PR, and my understanding is that it doesn't scale to large files.

@tcompa
Copy link
Collaborator Author

tcompa commented Jul 29, 2022

To be re-assessed in view of #20.

@jluethi jluethi added the Backlog Backlog issues we may eventually fix, but aren't a priority label Aug 4, 2022
@tcompa
Copy link
Collaborator Author

tcompa commented Sep 1, 2022

To be re-assessed in view of #27.

@jacopo-exact jacopo-exact transferred this issue from fractal-analytics-platform/fractal-client Sep 6, 2022
@tcompa
Copy link
Collaborator Author

tcompa commented Sep 19, 2022

This was removed from several tasks, as part of #79 and #80.
#83 will close this issue.

@tcompa
Copy link
Collaborator Author

tcompa commented Sep 26, 2022

Closed with #95. Let's re-open if any weird behavior appear.

@tcompa tcompa closed this as completed Sep 26, 2022
Repository owner moved this from TODO to Done in Fractal Project Management Sep 26, 2022
@jluethi jluethi moved this from Done to Done Archive in Fractal Project Management Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backlog Backlog issues we may eventually fix, but aren't a priority
Projects
None yet
Development

No branches or pull requests

2 participants