Replies: 1 comment
-
NEVERMIND! The problem causing the job to run indefinitely was entirely my dumb fault (not catching an edge case that caused an infinite while loop). Nevertheless, still curious if I can/should be closing still-running pods from previous jobs. Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, All.
I have a job that I have finished debugging and am ready to scale up and submit through
kbatch
. However, that of course brought its own set of issues that needed to be debugged. I have worked all of those issues out, as far as I can tell, but now I can no longer get the job to even start running (e.g., I have a job 'running' now for 5 days but yet to even print out theprint
statement at the top of the script).I've been racking my brain trying to figure out what's going wrong, and carefully rereading the
kbatch
docs. In doing that, it occurred to me to runkbatch pod list
. To my chagrin, I find that jobs stretching back to two weeks ago still have pods labeled 'Running' (12 total), even though I explicitly rankbatch job delete
for each one and the kbatch docs state that that should 'delete a job, cancelling running pods'.I'm wondering if all those still-running pods are sucking up resources, preventing me from relaunching my job, and if so, how I can actually cancel them and release the resources for reuse. (I picked through the kbatch docs but did not see anything I could easily figure out how to repurpose to do this myself, though I did see that the docstring for
_backend.py
states that 'kbatch users do not have access to the Kubernetes API', which gives me the impression that I may not be able to cancel them myself...)Thanks for any advice you can provide!
Beta Was this translation helpful? Give feedback.
All reactions