-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design and plan integration of MPI job to dev cluster #8
Comments
@chuckbelisle I edited the description to reflect next steps of integrating the oms web-ui with the kubernetes MPIJob manifest, which are both running separate currently but not integrated... is that the scope of work you want to target? If not, feel free to clarify... |
Some discovery of the current golang api code that handles the template replacement https://github.com/openmpp/go/blob/1b06e1a3e2d52d054ebc5073a5b3520aaed1732a/oms/runModel.go#L545C5-L545C5 https://github.com/openmpp/go/blob/1b06e1a3e2d52d054ebc5073a5b3520aaed1732a/oms/runModel.go#L564C76-L564C76 An option would be to do the template replacement on the kubernetes manifest file as a separate step, then feed that into the current template process that would just take the filename as an arguement with kubectl as the commad. |
Learning Golang to be able to modify the templates. |
Made initial commit with a partial solution. It involves translating the job run request arguments submitted to the openmpp web service into an MPIJob manifest and deploying it. Please refer to this commit for details: jacek-dudek/StatCan-aaw-kubeflow-containers@1becd31 |
@jacek-dudek for testing you can start a jupyterlab notebook in aaw-dev, and create these files manually, just be aware that the ompp install directory will get reset when you stop your notebook. You only need to go through the full commit / docker build and deploy when you need to do something that requires root permissions on the container or when you're ready to complete the ticket. |
@jacek-dudek when you have a few moments on Monday could you please update this task? |
Found a work-around the issue of being able to pipe script output into a kubectl client command. |
Example of error I'm currently getting when trying to apply MPIJob manifests:
Pat menitoned it looks like an access control issue that needs to be resolved with a role binding. |
@chuckbelisle Honestly think we're good to close this issue. We have Models running in MPI mode (although primitive implementation) on cluster. |
Epic: #2
You can submit an MPIJOB to the AAW dev/prod clusters with a manifest like the example here: https://github.com/StatCan/aaw-private/issues/95
This manifest (requires user to specify name/namespace for testing) will connect to blob storage (mounted by default with appropriate label), create a coordinator pod and X number of worker pods (with customized resources), into the default AAW user node pool. Smaller requests will launch onto existing nodes very quickly, while larger requests will provision a new node (the default user nodes are currently ~ 64 core with 256GB of memory).
An existing model can be configured with the Openm++ UI, the run parameters saved into it's accompanying database, that the MPIJOB will reference and save results.
Next steps would be to update the UI backend to be able to submit the manifest from the UI directly, possibly using a golang text/template to configure the manifest options.
The text was updated successfully, but these errors were encountered: