Replies: 1 comment 1 reply
-
Hard to say for sure. But typically, lines like
indicate that one node in a Dask cluster was trying to read something from another worker, but the other worker had unexpectedly died. There could be many reasons for this, but excessive memory usage by your worker process is a common culprit. At this point, you'd typically want to look into some metrics of the Dask worker pods and perhaps logs. Unfortunately, you don't have access to that information since you're running on the Planetary Computer's AKS cluster and we don't expose that information. Deploying your own compute in your own Azure subscription would let you collect all that information. But if you want to keep using the PC Hub, then you'll likely need to do some additional diagnostics to understand what's going on on your cluster just before workers start dying. |
Beta Was this translation helpful? Give feedback.
-
I finally have my job running smoothly. The job (which loops over Brazilian jurisdiction polygons, seen listed below, and runs a parallelized Monte Carlo-based raster analysis for each one) has produced correct results for a smaller test case. However, now that I've scaled it to all polygons I am getting strange behavior: Back-to-back runs on identical code break at different polygons (i.e., at different points in the total workload, indicating that it's not a particular polygon that is causing this problem). The polygons I've seen it stop at include 4.6_2, 4.6_2, 4.8_2, 4.6_2, 4.20_2, ... So it generally stops within a certain range of polygons (and I have confirmed that these can all be successfully computed when I limit the dataset just to them), which to me suggests that the problem has to do with the wall-clock time rather than with progress through the data or specific particularities embedded in the data.
The traceback just shows that my code's call to the
.compute()
method of a dask object (my main output for each polygon) then immediately leads to a whole stack of interconnected dask, dask_gateway, distributed, and tornado errors, all of which seem to suggest that an IO stream closed and/or the client shutdown (e.g.,concurrent.futures._base.CancelledError: [...] distributed.client - ERROR - cannot schedule new futures after shutdown
,tornado.iostream.StreamClosedError: Stream is closed
,RuntimeError: cannot schedule new futures after shutdown
, etc.).I'm not sure how to diagnose this, or if it could trace back to something I've done wrong, because it doesn't break at an identical point in the workflow.
Any ideas what might be going on and/or how to resolve?
Thanks!
FULL TRACEBACK:
Beta Was this translation helpful? Give feedback.
All reactions