-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_zarr_custom creates large temporary folders #53
Comments
This overhead could become quite limiting in large experiments, as you describe. I'd be a bit worried with a solution that removes originals before new data was saved. We should have a fail state that allows the user to rerun the task / the parts of the task that failed. Otherwise, if say a cluster node fails at a bad moment, the experiment looses data. Is this actually still an issue in dask? The corresponding PR mentions that they believe the issue doesn't exist anymore: dask/dask#7379 |
The PR was abandoned (and rightly so, as it would correspond to loading the whole data to memory before saving to disk, if I remember correctly), and the original issue is still open. The last message reads "[..] I think there will be a problem for array larger than memory, because if i understood correctly the graph all data must be loaded before to start storage.", and then there's not been any further activity. Thus yes, I'd say this is still an open issue. |
How much effort would it be to create a small test case we could use to show this issue and report back with that to the zarr issue then / (maybe better create a new one referencing this)? I think that issue stopped because there wasn't a clear test to see it fail. |
I saw this comment right after pinging dask people: dask/dask#5942 (comment)
There is a very clear test to see it fail: dask/dask#5942 (comment).
The only one that was shared is the one in the (abandoned) PR, and my understanding is that it doesn't scale to large files. |
To be re-assessed in view of #20. |
To be re-assessed in view of #27. |
Closed with #95. Let's re-open if any weird behavior appear. |
Branching from fractal-analytics-platform/fractal-client#62 (comment)
We currently use a custom function to allow
overwrite=True
indask.array.to_zarr()
, due to the issue described in dask/dask#5942. Our current code isEach time our custom
to_zarr_custom
is called, it first writes a level to a temporary zarr subfolder, then removes the original one, then moves the temporary one to the original one's path.In a realistic case with 23 wells of about 45G each (let's say that the high-resolution level takes ~30G), about 700G of temporary folders are created (hopefully not all of them at the same time). When disk space is tight, or when (unlikely) several tasks finish around the same time, or for even larger datasets, creating such huge temporary folders could be an issue.
Proposed solution: when
overwrite=True
, we may first remove the original subfolder and then write the new one directly into its correct path.The text was updated successfully, but these errors were encountered: