Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to retry when looking for hot added device nodes #2040

Merged

Conversation

katiewasnothere
Copy link
Contributor

There are two main issues to solve when hot adding devices in the UVM:

  • We have no generic way to know when a device is actually ready to use and the relevant /dev nodes have been created.
  • We have no way to know how many /dev nodes to look for unless we had device and vendor specific handling in our code.

Given that our existing support for assigned devices attempts to handle devices in a very device- and vendor- generic way, there is a race condition between when we hot add devices to the UVM for the first time during container creation and when the /dev nodes have been created. We attempt to minimize this by adding retry logic for when we try to find related /dev nodes.

This does not entirely solve our problems, since there could be a case where we find only a subset of the desired /dev nodes and return. However, for our current testing and use cases 500ms seems to be enough to capture all /dev nodes. We should investigate alternatives to this approach.

@katiewasnothere katiewasnothere requested a review from a team as a code owner February 27, 2024 20:38
Copy link
Contributor

@helsaawy helsaawy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits aside, does the context used for devicePathsFromPCIPath have a timeout to prevent infinite looping?
also, rather than a long retry time, maybe we add an initial delay (~250ms or something) to allow the devices to show up?

@katiewasnothere
Copy link
Contributor Author

nits aside, does the context used for devicePathsFromPCIPath have a timeout to prevent infinite looping? also, rather than a long retry time, maybe we add an initial delay (~250ms or something) to allow the devices to show up?

@helsaawy I'll add in an explicit deadline before we call devicePathsFromPCIPath. For the initial delay, I'm hesitant to do that just because if the dev nodes have all shown up by the time we try to find them, which does sometimes happen, we're adding an unnecessary delay, which I'm concerned the perf team would be unhappy with.

Copy link
Contributor

@anmaxvl anmaxvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, given other feedback is addressed.

Signed-off-by: Kathryn Baldauf <kabaldau@microsoft.com>
@katiewasnothere katiewasnothere force-pushed the kabaldau/retry_find_dev_nodes branch from 2120088 to 3d466f9 Compare March 7, 2024 02:22
@katiewasnothere katiewasnothere merged commit 02a899c into microsoft:main Mar 13, 2024
18 of 19 checks passed
@katiewasnothere katiewasnothere deleted the kabaldau/retry_find_dev_nodes branch March 13, 2024 17:46
princepereira pushed a commit to princepereira/hcsshim that referenced this pull request Aug 29, 2024
…_find_dev_nodes

Add the ability to retry when looking for hot added device nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants