Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race between persistentdw and destroy_persistent #169

Closed
bdevcich opened this issue Jun 20, 2024 · 5 comments
Closed

Race between persistentdw and destroy_persistent #169

bdevcich opened this issue Jun 20, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@bdevcich
Copy link
Contributor

With system-test where workflows are running in parallel (J>1), it can hit a case where the persistent usage tests are racing with the destroy case. What happens is that the workflow using the persistent storage can't finish PreRun because the destroy workflow beats it. Then both workflows are stuck until the usage workflow is removed.

So we end up with:

$ kubectl get workflows
NAME                         STATE      READY   STATUS       JOBID         AGE
fluxjob-172781307426766848   PreRun     false   DriverWait   fQGBxxiWiv3   30m
fluxjob-172781996030820352   Teardown   false   Error        fQGCH3r6VR9   29m

NnfAccess for the usage workflow says:

status:
  error:
    debugMessage: 'unable to create ClientMount resources: ClientMount.dataworkflowservices.github.io
      "default-fluxjob-172781307426766848-0-computes" is invalid: spec.mounts: Invalid
      value: 0: spec.mounts in body should have at least 1 items'
    severity: Minor
    type: Internal
    userMessage: unable to mount file system on client nodes

There's no ClientMount yet.

The destroy workflow says:

  message: 'DW Directive 0: User error: persistent storage cannot be deleted while
    in use'
  ready: false
  readyChange: "2024-06-20T20:09:44.290609Z"
  state: Teardown
  status: Error

Could the destroy proposal check the directivebreakdowns for any use of the directive name before it can leave Proposal? That way, as long as any usage workflow is out of proposal, there should be a directive breakdown that contains the persistent name:

  directive: '#DW persistentdw name=persistent-xfs-7c8c30f2'

Then the destroy can't get out of proposal until there are no directivebreakdowns left that contain that persistent name.

@bdevcich bdevcich added the bug Something isn't working label Jun 20, 2024
@bdevcich
Copy link
Contributor Author

@bdevcich
Copy link
Contributor Author

We determined that this is most likely a flux issue and there is no race condition here.

What is happening is that flux is picking different computes and those computes are not tied to the same rabbit that has the persistent filesystem.

So if persistent gfs2 is created on rabbit-0 and then a compute from rabbit-1 tries to use it, it cannot mount the filesystem on the compute.

This does not appear to be an issue with lustre since I can be mounted from anywhere.

@matthew-richerson
Copy link
Contributor

@bdevcich can this be closed?

@bdevcich
Copy link
Contributor Author

@bdevcich can this be closed?

Yes. Do you have the flux issue we can link?

@github-project-automation github-project-automation bot moved this from 📋 Open to ✅ Closed in Issues Dashboard Jul 10, 2024
@matthew-richerson
Copy link
Contributor

Flux issue: flux-framework/flux-coral2#170

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Closed
Development

No branches or pull requests

2 participants