-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race between persistentdw
and destroy_persistent
#169
Comments
We determined that this is most likely a flux issue and there is no race condition here. What is happening is that flux is picking different computes and those computes are not tied to the same rabbit that has the persistent filesystem. So if persistent gfs2 is created on rabbit-0 and then a compute from rabbit-1 tries to use it, it cannot mount the filesystem on the compute. This does not appear to be an issue with lustre since I can be mounted from anywhere. |
@bdevcich can this be closed? |
Yes. Do you have the flux issue we can link? |
Flux issue: flux-framework/flux-coral2#170 |
With system-test where workflows are running in parallel (
J>1
), it can hit a case where the persistent usage tests are racing with the destroy case. What happens is that the workflow using the persistent storage can't finishPreRun
because the destroy workflow beats it. Then both workflows are stuck until the usage workflow is removed.So we end up with:
NnfAccess
for the usage workflow says:There's no ClientMount yet.
The destroy workflow says:
Could the destroy proposal check the directivebreakdowns for any use of the directive name before it can leave Proposal? That way, as long as any usage workflow is out of proposal, there should be a directive breakdown that contains the persistent name:
Then the destroy can't get out of proposal until there are no directivebreakdowns left that contain that persistent name.
The text was updated successfully, but these errors were encountered: