otherwise working analysis breaks at various points on dask cluster (CancelledError, StreamClosedError, RuntimeError...) #255

erthward · 2023-08-09T20:54:06Z

erthward
Aug 9, 2023

I finally have my job running smoothly. The job (which loops over Brazilian jurisdiction polygons, seen listed below, and runs a parallelized Monte Carlo-based raster analysis for each one) has produced correct results for a smaller test case. However, now that I've scaled it to all polygons I am getting strange behavior: Back-to-back runs on identical code break at different polygons (i.e., at different points in the total workload, indicating that it's not a particular polygon that is causing this problem). The polygons I've seen it stop at include 4.6_2, 4.6_2, 4.8_2, 4.6_2, 4.20_2, ... So it generally stops within a certain range of polygons (and I have confirmed that these can all be successfully computed when I limit the dataset just to them), which to me suggests that the problem has to do with the wall-clock time rather than with progress through the data or specific particularities embedded in the data.

The traceback just shows that my code's call to the .compute() method of a dask object (my main output for each polygon) then immediately leads to a whole stack of interconnected dask, dask_gateway, distributed, and tornado errors, all of which seem to suggest that an IO stream closed and/or the client shutdown (e.g., concurrent.futures._base.CancelledError: [...] distributed.client - ERROR - cannot schedule new futures after shutdown, tornado.iostream.StreamClosedError: Stream is closed, RuntimeError: cannot schedule new futures after shutdown, etc.).

I'm not sure how to diagnose this, or if it could trace back to something I've done wrong, because it doesn't break at an identical point in the workflow.

Any ideas what might be going on and/or how to resolve?

Thanks!

FULL TRACEBACK:

        now processing BRA.1.1_2...
	now processing BRA.1.2_2...
	now processing BRA.1.3_2...

...       
...       
...       

	now processing BRA.4.4_2...
	now processing BRA.4.5_2...
	now processing BRA.4.6_2...
Traceback (most recent call last):
  File "/home/jovyan/avoided_forest_conversion/analysis/analysis_files/AFC_analysis_script.py", line 432, in <module>
    main(curtis_region=CURTIS_REGION,
  File "/home/jovyan/avoided_forest_conversion/analysis/analysis_files/AFC_analysis_script.py", line 389, in main
    its_results = dict(dask.compute(*lazy_its_results))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask/base.py", line 599, in compute
    results = schedule(dsk, keys, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 3226, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 2361, in gather
    return self.sync(
           ^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 351, in sync
    return sync(
           ^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 418, in sync
    raise exc.with_traceback(tb)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 391, in f
    result = yield future
             ^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/gen.py", line 767, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 2225, in _gather
    raise exc
concurrent.futures._base.CancelledError: parallel_fn-ef148ade-e350-4157-a2da-4100ecb836c3
2023-08-09 19:16:49,734 - distributed.client - ERROR - cannot schedule new futures after shutdown
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1535, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TLS (closed) Client->Scheduler local=tls://10.244.114.28:52328 remote=gateway://traefik-dhub-prod-dask-gateway.prod:80/prod.60df2170569b4101a12d25c8d5f2f739>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1325, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1355, in _ensure_connected
    comm = await connect(
           ^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/core.py", line 336, in connect
    comm = await wait_for(
           ^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/comm.py", line 45, in connect
    plain_stream = await self.client.connect(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/tcpclient.py", line 269, in connect
    addrinfo = await self.resolver.resolve(host, port, af)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/concurrent.py", line 122, in wrapper
    conc_future = getattr(self, executor).submit(fn, self, *args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
cannot schedule new futures after shutdown
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1535, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TLS (closed) Client->Scheduler local=tls://10.244.114.28:52328 remote=gateway://traefik-dhub-prod-dask-gateway.prod:80/prod.60df2170569b4101a12d25c8d5f2f739>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1551, in _handle_report
    await self._reconnect()
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1325, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1355, in _ensure_connected
    comm = await connect(
           ^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/core.py", line 336, in connect
    comm = await wait_for(
           ^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/comm.py", line 45, in connect
    plain_stream = await self.client.connect(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/tcpclient.py", line 269, in connect
    addrinfo = await self.resolver.resolve(host, port, af)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/concurrent.py", line 122, in wrapper
    conc_future = getattr(self, executor).submit(fn, self, *args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
2023-08-09 19:16:49,739 - distributed.client - ERROR - cannot schedule new futures after shutdown
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1535, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TLS (closed) Client->Scheduler local=tls://10.244.114.28:52328 remote=gateway://traefik-dhub-prod-dask-gateway.prod:80/prod.60df2170569b4101a12d25c8d5f2f739>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1691, in _close
    async with self._wait_for_handle_report_task(fast=fast):
  File "/srv/conda/envs/notebook/lib/python3.11/contextlib.py", line 211, in __aexit__
    await anext(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1649, in _wait_for_handle_report_task
    await wait_for(handle_report_task, 0 if fast else 2)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1551, in _handle_report
    await self._reconnect()
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1325, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1355, in _ensure_connected
    comm = await connect(
           ^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/core.py", line 336, in connect
    comm = await wait_for(
           ^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/comm.py", line 45, in connect
    plain_stream = await self.client.connect(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/tcpclient.py", line 269, in connect
    addrinfo = await self.resolver.resolve(host, port, af)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/concurrent.py", line 122, in wrapper
    conc_future = getattr(self, executor).submit(fn, self, *args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
2023-08-09 19:16:49,741 - distributed.client - WARNING - Exception raised while closing clients
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1535, in _handle_report
    msgs = await self.scheduler_comm.comm.read()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TLS (closed) Client->Scheduler local=tls://10.244.114.28:52328 remote=gateway://traefik-dhub-prod-dask-gateway.prod:80/prod.60df2170569b4101a12d25c8d5f2f739>: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/client.py", line 985, in _stop_internal
    await self._stop_task
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/client.py", line 1017, in _stop_async
    await client._close()
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1691, in _close
    async with self._wait_for_handle_report_task(fast=fast):
  File "/srv/conda/envs/notebook/lib/python3.11/contextlib.py", line 211, in __aexit__
    await anext(self.gen)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1649, in _wait_for_handle_report_task
    await wait_for(handle_report_task, 0 if fast else 2)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1551, in _handle_report
    await self._reconnect()
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 754, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1325, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/client.py", line 1355, in _ensure_connected
    comm = await connect(
           ^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/core.py", line 336, in connect
    comm = await wait_for(
           ^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/utils.py", line 1873, in wait_for
    return await fut
           ^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/comm.py", line 45, in connect
    plain_stream = await self.client.connect(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/tcpclient.py", line 269, in connect
    addrinfo = await self.resolver.resolve(host, port, af)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/tornado/concurrent.py", line 122, in wrapper
    conc_future = getattr(self, executor).submit(fn, self, *args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
2023-08-09 19:16:49,743 - distributed.client - ERROR - Exception raised while shutting down cluster prod.60df2170569b4101a12d25c8d5f2f739
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/client.py", line 994, in _stop_internal
    await self.gateway._stop_cluster(self.name)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/client.py", line 655, in _stop_cluster
    await self._request("DELETE", url)
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/dask_gateway/client.py", line 397, in _request
    resp = await session.request(method, url, json=json, **self._request_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/client.py", line 536, in _request
    conn = await self._connector.connect(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/connector.py", line 540, in connect
    proto = await self._create_connection(req, traces, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/connector.py", line 901, in _create_connection
    _, proto = await self._create_direct_connection(req, traces, timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/connector.py", line 1152, in _create_direct_connection
    hosts = await asyncio.shield(host_resolved)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/connector.py", line 874, in _resolve_host
    addrs = await self._resolver.resolve(host, port, family=self._family)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/aiohttp/resolver.py", line 33, in resolve
    infos = await self._loop.getaddrinfo(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/asyncio/base_events.py", line 867, in getaddrinfo
    return await self.run_in_executor(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/asyncio/base_events.py", line 829, in run_in_executor
    executor.submit(func, *args), loop=self)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/srv/conda/envs/notebook/lib/python3.11/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown

TomAugspurger · 2023-08-14T01:27:31Z

TomAugspurger
Aug 14, 2023

Hard to say for sure. But typically, lines like

    raise exc
concurrent.futures._base.CancelledError: parallel_fn-ef148ade-e350-4157-a2da-4100ecb836c3
2023-08-09 19:16:49,734 - distributed.client - ERROR - cannot schedule new futures after shutdown
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.11/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)

indicate that one node in a Dask cluster was trying to read something from another worker, but the other worker had unexpectedly died. There could be many reasons for this, but excessive memory usage by your worker process is a common culprit.

At this point, you'd typically want to look into some metrics of the Dask worker pods and perhaps logs. Unfortunately, you don't have access to that information since you're running on the Planetary Computer's AKS cluster and we don't expose that information. Deploying your own compute in your own Azure subscription would let you collect all that information. But if you want to keep using the PC Hub, then you'll likely need to do some additional diagnostics to understand what's going on on your cluster just before workers start dying.

1 reply

erthward Aug 14, 2023
Author

Thanks. This all makes sense, and is a big help! It's hard to read between the lines of the dask tracebacks, having only a semisolid sense of what it's doing. So far I've been deferring to the dask dashboard on an interactive job in order to debug stuff like this, and when these failures occurred I saw no memory usage spikes (I checked carefully on multiple separate runs), but I'll have to go back to the drawing board and to think about whether there's a way to re-parallelize the work so as to better balance memory usage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otherwise working analysis breaks at various points on dask cluster (CancelledError, StreamClosedError, RuntimeError...) #255

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

otherwise working analysis breaks at various points on dask cluster (CancelledError, StreamClosedError, RuntimeError...) #255

erthward Aug 9, 2023

Replies: 1 comment · 1 reply

TomAugspurger Aug 14, 2023

erthward Aug 14, 2023 Author

erthward
Aug 9, 2023

Replies: 1 comment 1 reply

TomAugspurger
Aug 14, 2023

erthward Aug 14, 2023
Author