Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEJobRunner handle jobs which don't finish cleanly (e.g. killed by qmaster) #111

Open
pjbriggs opened this issue Aug 15, 2019 · 0 comments

Comments

@pjbriggs
Copy link
Member

The GEJobRunner class is currently not able to handle with jobs which don't finish cleanly, for example because they have killed by qmaster for exceeding time or resource limits (e.g. memory usage), or if they are killed by an external qdel invocation.

One possibility to mitigate this is to try and trap for and handle signals sent by qmaster when jobs are killed in this manner. In principle this could be done by submitting jobs with the -notify qsub option:

-notify
          ...
          This flag, when set, causes Grid Engine to send  "warning"  signals  to  a running job
          prior to sending the signals themselves. If a SIGSTOP is pending, the job will receive
          a SIGUSR1  several seconds  before  the  SIGSTOP.  If a SIGKILL is pending, the job
          will receive a SIGUSR2 several seconds before the SIGKILL.  This option provides the
          running job a configured time interval to do cleanup operations before receiving the
          SIGSTOP or SIGKILL.
          ...

The SIGUSR1 and SIGUSR2 could be potentially be handled by the trap functionality in the bash wrapper scripts used for managing each job within GEJobRunner, e.g.

#!/bin/bash
...
function handle_sigusr1() {
    // Write exit code file here and exit
}
trap handle_sigusr1 SIGUSR1
...

It's not clear if this would work in practice (but handling these situations would be useful).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant