Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and plan integration of MPI job to dev cluster #8

Closed
chuckbelisle opened this issue Sep 20, 2023 · 9 comments
Closed

Design and plan integration of MPI job to dev cluster #8

chuckbelisle opened this issue Sep 20, 2023 · 9 comments
Assignees

Comments

@chuckbelisle
Copy link
Contributor

chuckbelisle commented Sep 20, 2023

Epic: #2

You can submit an MPIJOB to the AAW dev/prod clusters with a manifest like the example here: https://github.com/StatCan/aaw-private/issues/95

This manifest (requires user to specify name/namespace for testing) will connect to blob storage (mounted by default with appropriate label), create a coordinator pod and X number of worker pods (with customized resources), into the default AAW user node pool. Smaller requests will launch onto existing nodes very quickly, while larger requests will provision a new node (the default user nodes are currently ~ 64 core with 256GB of memory).

An existing model can be configured with the Openm++ UI, the run parameters saved into it's accompanying database, that the MPIJOB will reference and save results.

Next steps would be to update the UI backend to be able to submit the manifest from the UI directly, possibly using a golang text/template to configure the manifest options.

@vexingly
Copy link

vexingly commented Sep 20, 2023

@chuckbelisle I edited the description to reflect next steps of integrating the oms web-ui with the kubernetes MPIJob manifest, which are both running separate currently but not integrated... is that the scope of work you want to target? If not, feel free to clarify...

@vexingly
Copy link

Some discovery of the current golang api code that handles the template replacement

https://github.com/openmpp/go/blob/1b06e1a3e2d52d054ebc5073a5b3520aaed1732a/oms/runModel.go#L545C5-L545C5
Here is where it does the variable replacement, one issue is all of the model arguments are included in a single variable (which has the # of sub samples and threads that we would need), so this needs to be expanded.

https://github.com/openmpp/go/blob/1b06e1a3e2d52d054ebc5073a5b3520aaed1732a/oms/runModel.go#L564C76-L564C76
Here is where it reads the template line by line and removes all of the spaces, which will destroy the manifest file of course, so a different solution will be necessary... not sure if this was done for security or some other reason, but either the kubernetes manifest needs to be done as a separate template method or the logic for how it breaks down the template file will need to be changed.

An option would be to do the template replacement on the kubernetes manifest file as a separate step, then feed that into the current template process that would just take the filename as an arguement with kubectl as the commad.

@KrisWilliamson
Copy link
Contributor

Learning Golang to be able to modify the templates.

@jacek-dudek
Copy link
Collaborator

Made initial commit with a partial solution. It involves translating the job run request arguments submitted to the openmpp web service into an MPIJob manifest and deploying it.

Please refer to this commit for details: jacek-dudek/StatCan-aaw-kubeflow-containers@1becd31

@vexingly
Copy link

vexingly commented Oct 3, 2023

@jacek-dudek for testing you can start a jupyterlab notebook in aaw-dev, and create these files manually, just be aware that the ompp install directory will get reset when you stop your notebook.

You only need to go through the full commit / docker build and deploy when you need to do something that requires root permissions on the container or when you're ready to complete the ticket.

@chuckbelisle
Copy link
Contributor Author

@jacek-dudek when you have a few moments on Monday could you please update this task?

@jacek-dudek
Copy link
Collaborator

Found a work-around the issue of being able to pipe script output into a kubectl client command.
Resolved some syntax issues with the manifests that were being output by studying the kubectl errors that were being returned.
Currently researching why running an mpi enabled model on the aaw cluster doesn't work.

@chuckbelisle chuckbelisle mentioned this issue Oct 17, 2023
@jacek-dudek
Copy link
Collaborator

jacek-dudek commented Oct 17, 2023

Example of error I'm currently getting when trying to apply MPIJob manifests:

(base) jovyan@testing-openmpp-0:/opt/openmpp/1.15.4/etc$ kubectl apply -f temp.yaml
Error from server (Forbidden): error when retrieving current configuration of:
Resource: "[kubeflow.org/v1](http://kubeflow.org/v1), Resource=mpijobs", GroupVersionKind: "[kubeflow.org/v1](http://kubeflow.org/v1), Kind=MPIJob"
Name: "riskpaths", Namespace: "aaw-team"
Object: &{map["apiVersion":"[kubeflow.org/v1](http://kubeflow.org/v1)" "kind":"MPIJob" "metadata":map["annotations":map["[kubectl.kubernetes.io/last-applied-configuration](http://kubectl.kubernetes.io/last-applied-configuration)":""] "name":"riskpaths" "namespace":"aaw-team"] "spec":map["mpiReplicaSpecs":map["Launcher":map["replicas":'\x01' "template":map["metadata":map["labels":map["[data.statcn.gc.ca/inject-blob-volumes](http://data.statcn.gc.ca/inject-blob-volumes)":"true" "[sidecar.istio.io/inject](http://sidecar.istio.io/inject)":"false"]] "spec":map["containers":[map["command":["mpirun" "-n" "7" "--bind-to" "none" "-wdir" "/home/jovyan/models/bin" "-OpenM.RunStamp" "2023_10_16_17_14_17_436" "-OpenM.LogToConsole" "true" "-OpenM.LogToFile" "false" "-OpenM.SetName" "Default" "-OpenM.LogRank" "true" "-OpenM.MessageLanguage" "en-US" "-OpenM.RunName" "RiskPaths_Default_2023_10_16_13_12_54_034"] "image":"[k8scc01covidacr.azurecr.io/ompp-run-ubuntu:d1174244baa102b26f0850c3883e7b15c50a3b64](http://k8scc01covidacr.azurecr.io/ompp-run-ubuntu:d1174244baa102b26f0850c3883e7b15c50a3b64)" "name":"riskpaths-launcher" "resources":map["limits":map["cpu":"2" "memory":"2Gi"] "requests":map["cpu":"250m" "memory":"250Mi"]]]]]]] "Worker":map["replicas":'\a' "template":map["spec":map["containers":[map["image":"[k8scc01covidacr.azurecr.io/ompp-run-ubuntu:d1174244baa102b26f0850c3883e7b15c50a3b64](http://k8scc01covidacr.azurecr.io/ompp-run-ubuntu:d1174244baa102b26f0850c3883e7b15c50a3b64)" "name":"riskpaths-worker" "resources":map["limits":map["cpu":"2" "memory":"2Gi"] "requests":map["cpu":"2" "memory":"1Gi"]]]]]]]] "runPolicy":map["cleanPodPolicy":"Running"]]]}
from server for: "temp.yaml": [mpijobs.kubeflow.org](http://mpijobs.kubeflow.org/) "riskpaths" is forbidden: User "system:serviceaccount:charles-belisle:default-editor" cannot get resource "mpijobs" in API group "[kubeflow.org](http://kubeflow.org/)" in the namespace "aaw-team"

Pat menitoned it looks like an access control issue that needs to be resolved with a role binding.

@Souheil-Yazji
Copy link
Contributor

@chuckbelisle Honestly think we're good to close this issue. We have Models running in MPI mode (although primitive implementation) on cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants