to_zarr_custom creates large temporary folders #53

tcompa · 2022-06-22T08:50:29Z

Branching from fractal-analytics-platform/fractal-client#62 (comment)

We currently use a custom function to allow overwrite=True in dask.array.to_zarr(), due to the issue described in dask/dask#5942. Our current code is

def to_zarr_custom(newzarrurl=None, array=None, component="", overwrite=False):

    [...]

    if overwrite:
        tmp_suffix = "_TEMPORARY_ZARR_ARRAY"
        array.to_zarr(
            newzarrurl,
            component=component + tmp_suffix,
            dimension_separator="/",
        )
        shutil.rmtree(newzarrurl + component)
        shutil.move(
            newzarrurl + component + tmp_suffix, newzarrurl + component
        )
    else:
        array.to_zarr(newzarrurl, component=component, dimension_separator="/")

Each time our custom to_zarr_custom is called, it first writes a level to a temporary zarr subfolder, then removes the original one, then moves the temporary one to the original one's path.

In a realistic case with 23 wells of about 45G each (let's say that the high-resolution level takes ~30G), about 700G of temporary folders are created (hopefully not all of them at the same time). When disk space is tight, or when (unlikely) several tasks finish around the same time, or for even larger datasets, creating such huge temporary folders could be an issue.

Proposed solution: when overwrite=True, we may first remove the original subfolder and then write the new one directly into its correct path.

The text was updated successfully, but these errors were encountered:

jluethi · 2022-06-27T12:53:07Z

This overhead could become quite limiting in large experiments, as you describe. I'd be a bit worried with a solution that removes originals before new data was saved. We should have a fail state that allows the user to rerun the task / the parts of the task that failed. Otherwise, if say a cluster node fails at a bad moment, the experiment looses data.

Is this actually still an issue in dask? The corresponding PR mentions that they believe the issue doesn't exist anymore: dask/dask#7379

tcompa · 2022-06-27T12:57:08Z

Is this actually still an issue in dask? The corresponding PR mentions that they believe the issue doesn't exist anymore: dask/dask#7379

The PR was abandoned (and rightly so, as it would correspond to loading the whole data to memory before saving to disk, if I remember correctly), and the original issue is still open. The last message reads "[..] I think there will be a problem for array larger than memory, because if i understood correctly the graph all data must be loaded before to start storage.", and then there's not been any further activity.

Thus yes, I'd say this is still an open issue.

jluethi · 2022-06-27T13:10:11Z

How much effort would it be to create a small test case we could use to show this issue and report back with that to the zarr issue then / (maybe better create a new one referencing this)? I think that issue stopped because there wasn't a clear test to see it fail.
Or maybe we look for other people who have created workarounds for this already? We can't be the only ones hitting this

tcompa · 2022-06-27T13:19:41Z

I saw this comment right after pinging dask people: dask/dask#5942 (comment)

How much effort would it be to create a small test case we could use to show this issue and report back with that to the zarr issue then / (maybe better create a new one referencing this)? I think that issue stopped because there wasn't a clear test to see it fail.

There is a very clear test to see it fail: dask/dask#5942 (comment).
Still, if someone answers on that issue, then I am happy to create a new test (starting from here). It's not actually needed, but I'll do it if it's to keep the discussion alive.

Or maybe we look for other people who have created workarounds for this already?

The only one that was shared is the one in the (abandoned) PR, and my understanding is that it doesn't scale to large files.

tcompa · 2022-07-29T08:06:46Z

To be re-assessed in view of #20.

tcompa · 2022-09-01T14:24:50Z

To be re-assessed in view of #27.

tcompa · 2022-09-19T10:51:51Z

This was removed from several tasks, as part of #79 and #80.
#83 will close this issue.

tcompa · 2022-09-26T13:28:29Z

Closed with #95. Let's re-open if any weird behavior appear.

tcompa self-assigned this Jun 22, 2022

tcompa changed the title ~~Custom to_zarr has creates large temporary folders~~ to_zarr_custom creates large temporary folders Jun 22, 2022

tcompa mentioned this issue Jul 12, 2022

Move create_pyramid to write_pyramid, to avoid repeated calls of heavy functions fractal-analytics-platform/fractal-client#97

Closed

jluethi self-assigned this Jul 12, 2022

tcompa mentioned this issue Jul 29, 2022

Parallel writing of zarr array from single process #20

Closed

jluethi added the Backlog Backlog issues we may eventually fix, but aren't a priority label Aug 4, 2022

jacopo-exact transferred this issue from fractal-analytics-platform/fractal-client Sep 6, 2022

This was referenced Sep 23, 2022

Replace write_pyramid->build_pyramid in maximum_intensity_projection #96

Closed

Deprecate to_zarr_custom #97

Closed

tcompa closed this as completed Sep 26, 2022

Repository owner moved this from TODO to Done in Fractal Project Management Sep 26, 2022

jluethi moved this from Done to Done Archive in Fractal Project Management Oct 5, 2022

jluethi removed this from Fractal Project Management Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_zarr_custom creates large temporary folders #53

to_zarr_custom creates large temporary folders #53

tcompa commented Jun 22, 2022

jluethi commented Jun 27, 2022

tcompa commented Jun 27, 2022

jluethi commented Jun 27, 2022

tcompa commented Jun 27, 2022

tcompa commented Jul 29, 2022

tcompa commented Sep 1, 2022

tcompa commented Sep 19, 2022

tcompa commented Sep 26, 2022

to_zarr_custom creates large temporary folders #53

to_zarr_custom creates large temporary folders #53

Comments

tcompa commented Jun 22, 2022

jluethi commented Jun 27, 2022

tcompa commented Jun 27, 2022

jluethi commented Jun 27, 2022

tcompa commented Jun 27, 2022

tcompa commented Jul 29, 2022

tcompa commented Sep 1, 2022

tcompa commented Sep 19, 2022

tcompa commented Sep 26, 2022