Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch Distributed UCX comms to allow configuring connect timeout #80

Merged
merged 2 commits into from
Jan 21, 2025

Conversation

pentschev
Copy link
Member

This feature is required for RAPIDS 25.02, therefore we cannot upstream and wait for a Distributed release on time for the RAPIDS release.

The code here is a verbatim copy of Distributed's ucx.py, only adding a new variable distributed.comm.ucx.connect-timeout that allows controlling the timeout of UCX exchange peer info procedure (implemented in rapidsai/ucx-py#1103).

@pentschev pentschev requested a review from a team as a code owner January 21, 2025 18:25
@pentschev pentschev added feature request New feature or request non-breaking Introduces a non-breaking change labels Jan 21, 2025
cuda_visible_device, cuda_context_created.device_info, os.getpid()
)

connect_timeout = dask.config.get("distributed.comm.ucx.connect-timeout", None)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is added.

# that don't override ucx_config or existing slots in the
# environment, so the user's external environment can safely
# override things here.
ucp.init(options=ucx_config, env_takes_precedence=True, connect_timeout=connect_timeout)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And connect_timeout is added to this line as well. This and the line above are the only changes to the original UCX comms in Distributed.

@quasiben
Copy link
Member

@pentschev can you also provide an example of using this config option ?

@pentschev
Copy link
Member Author

@pentschev can you also provide an example of using this config option ?

In Dask you just either set:

dask.config.set({"distributed.comm.ucx.connect-timeout": 120})

Or via environment variable:

DASK_DISTRIBUTED__COMM__UCX__CONNECT_TIMEOUT=120

I think here is not the place to document that though, we should do that in Dask-CUDA. I'll try to do that tomorrow.

@pentschev
Copy link
Member Author

Thanks @quasiben and @galipremsagar .

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 4b8d9f0 into rapidsai:branch-25.02 Jan 21, 2025
9 checks passed
@pentschev pentschev deleted the ucx-connect-timeout branch January 22, 2025 15:33
@pentschev
Copy link
Member Author

As promised (a day late), I've opened rapidsai/dask-cuda#1428 to include this into Dask-CUDA docs.

rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Jan 23, 2025
A new configuration to the UCX comms module was introduced in rapidsai/rapids-dask-dependency#80, this is designed to help with timeouts in larger clusters, and sometimes even small ones depending on the architecture. This change documents that new configuration.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #1428
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants