-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch Distributed UCX comms to allow configuring connect timeout #80
Patch Distributed UCX comms to allow configuring connect timeout #80
Conversation
cuda_visible_device, cuda_context_created.device_info, os.getpid() | ||
) | ||
|
||
connect_timeout = dask.config.get("distributed.comm.ucx.connect-timeout", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is added.
# that don't override ucx_config or existing slots in the | ||
# environment, so the user's external environment can safely | ||
# override things here. | ||
ucp.init(options=ucx_config, env_takes_precedence=True, connect_timeout=connect_timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And connect_timeout
is added to this line as well. This and the line above are the only changes to the original UCX comms in Distributed.
@pentschev can you also provide an example of using this config option ? |
In Dask you just either set:
Or via environment variable:
I think here is not the place to document that though, we should do that in Dask-CUDA. I'll try to do that tomorrow. |
Thanks @quasiben and @galipremsagar . |
/merge |
As promised (a day late), I've opened rapidsai/dask-cuda#1428 to include this into Dask-CUDA docs. |
A new configuration to the UCX comms module was introduced in rapidsai/rapids-dask-dependency#80, this is designed to help with timeouts in larger clusters, and sometimes even small ones depending on the architecture. This change documents that new configuration. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) URL: #1428
This feature is required for RAPIDS 25.02, therefore we cannot upstream and wait for a Distributed release on time for the RAPIDS release.
The code here is a verbatim copy of Distributed's ucx.py, only adding a new variable
distributed.comm.ucx.connect-timeout
that allows controlling the timeout of UCX exchange peer info procedure (implemented in rapidsai/ucx-py#1103).