You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The GEJobRunner class is currently not able to handle with jobs which don't finish cleanly, for example because they have killed by qmaster for exceeding time or resource limits (e.g. memory usage), or if they are killed by an external qdel invocation.
One possibility to mitigate this is to try and trap for and handle signals sent by qmaster when jobs are killed in this manner. In principle this could be done by submitting jobs with the -notifyqsub option:
-notify
...
This flag, when set, causes Grid Engine to send "warning" signals to a running job
prior to sending the signals themselves. If a SIGSTOP is pending, the job will receive
a SIGUSR1 several seconds before the SIGSTOP. If a SIGKILL is pending, the job
will receive a SIGUSR2 several seconds before the SIGKILL. This option provides the
running job a configured time interval to do cleanup operations before receiving the
SIGSTOP or SIGKILL.
...
The SIGUSR1 and SIGUSR2 could be potentially be handled by the trap functionality in the bash wrapper scripts used for managing each job within GEJobRunner, e.g.
#!/bin/bash
...
function handle_sigusr1() {
// Write exit code file here and exit
}
trap handle_sigusr1 SIGUSR1
...
It's not clear if this would work in practice (but handling these situations would be useful).
The text was updated successfully, but these errors were encountered:
The
GEJobRunner
class is currently not able to handle with jobs which don't finish cleanly, for example because they have killed byqmaster
for exceeding time or resource limits (e.g. memory usage), or if they are killed by an externalqdel
invocation.One possibility to mitigate this is to try and trap for and handle signals sent by
qmaster
when jobs are killed in this manner. In principle this could be done by submitting jobs with the-notify
qsub
option:The
SIGUSR1
andSIGUSR2
could be potentially be handled by thetrap
functionality in thebash
wrapper scripts used for managing each job withinGEJobRunner
, e.g.It's not clear if this would work in practice (but handling these situations would be useful).
The text was updated successfully, but these errors were encountered: